Parallel Seq Scan

Started by Amit Kapilaabout 11 years ago496 messages
#1Amit Kapila
amit.kapila16@gmail.com
1 attachment(s)

As per discussion on another thread related to using
custom scan nodes for prototype of parallel sequence scan,
I have developed the same, but directly by adding
new nodes for parallel sequence scan. There might be
some advantages for developing this as a contrib
module by using custom scan nodes, however I think
we might get stucked after some point due to custom
scan node capability as pointed out by Andres.

The basic idea used is that while evaluating the cheapest
path for scan, optimizer will also evaluate if it can use
parallel seq path. Currently I have kept a very simple
model to calculate the cost of parallel sequence path which
is that divide the cost for CPU and disk by availble number
of worker backends (We can enhance it based on further
experiments and discussion; we need to consider worker startup
and dynamic shared memory setup cost as well). The work aka
scan of blocks is divided equally among all workers (except for
corner cases where blocks can't be equally divided among workers,
the last worker will be responsible for scanning the remaining blocks).

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

In ExecutorStart phase, initiate the required number of workers
as per parallel seq scan plan and setup dynamic shared memory and
share the information required for worker to execute the scan.
Currently I have just shared the relId, targetlist and number
of blocks to be scanned by worker, however I think we might want
to generate a plan for each of the workers in master backend and
then share the same to individual worker.
Now to fetch the data from multiple queues corresponding to each
worker a simple mechanism is used that is fetch from first queue
till all the data is consumed from same, then fetch from second
queue and so on. Also here master backend is responsible for just
getting the data from workers and passing it back to client.
I am sure that we can improve this strategy in many ways
like by making master backend to also perform scan for some
of the blocks rather than just getting data from workers and
a better strategy to fetch the data from multiple queues.

Worker backend will receive the information related to scan
from master backend and generate the plan from same and
execute that plan, so here the work to scan the data after
generating the plan is very much similar to exec_simple_query()
(i.e Create the portal and run it based on planned statement)
except that worker backends will initialize the block range it want to
scan in executor initialization phase (ExecInitSeqScan()).
Workers will exit after sending the data to master backend
which essentially means that for each execution we need
to initiate the workers, I think here we can improve by giving the
control for workers to postmaster so that we don't need to
initialize them each time during execution, however this can
be a totally separate optimization which is better to be done
independently of this patch.
As currently we don't have mechanism to share transaction
state, I have used separate transaction in worker backend to
execute the plan.

Any error in master backend either via backend worker or due
to other issue in master backend itself should terminate all the
workers before aborting the transaction.
We can't do it with the error context callback mechanism
(error_context_stack) which we use at other places in code, as
for this case we need it from the time workers are started till
the execution is complete (error_context_stack could get reset
once the control goes out of the function which has set it.)
One way could be that maintain the callback information in
TransactionState and use it to kill the workers before aborting
transaction in main backend. Another could be that have another
variable similar to error_context_stack (which will be used
specifically for storing the workers state), and kill the workers
in errfinish via callback. Currently I have handled it at the time of
detaching from shared memory.
Another point that needs to be taken care in worker backend is
that if any error occurs, we should *not* abort the transaction as
the transaction state is shared across all workers.

Currently the parallel seq scan will not be considered
for statements other than SELECT or if there is a join in
the statement or if statement contains quals or if target
list contains non-Var fields. We can definitely support
simple quals and targetlist other than non-Vars. By simple,
I means that it should not contain functions or some other
conditions which can't be pushed down to worker backend.

Behaviour of some simple statements with patch is as below:

postgres=# create table t1(c1 int, c2 char(500)) with (fillfactor=10);
CREATE TABLE

postgres=# insert into t1 values(generate_series(1,100),'amit');
INSERT 0 100

postgres=# explain select c1 from t1;
QUERY PLAN
------------------------------------------------------
Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4)
(1 row)

postgres=# set parallel_seqscan_degree=4;
SET
postgres=# explain select c1 from t1;
QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
Number of Workers: 4
Number of Blocks Per Workers: 25
(3 rows)

postgres=# explain select Distinct(c1) from t1;
QUERY PLAN
--------------------------------------------------------------------
HashAggregate (cost=25.50..26.50 rows=100 width=4)
Group Key: c1
-> Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
Number of Workers: 4
Number of Blocks Per Workers: 25
(5 rows)

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v1.patchapplication/octet-stream; name=parallel_seqscan_v1.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..7be7ba8
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,374 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg);
+
+/*
+ * Indicate that an error came from a particular worker.
+ */
+static void
+worker_error_callback(void *arg)
+{
+	pid_t	pid = * (pid_t *) arg;
+
+	errcontext("worker backend, pid %d", pid);
+}
+
+/*
+ * shm_beginscan -
+ *		Initializes the shared memory scan descriptor to retrieve tuples
+ *		from worker backends. 
+ */
+ShmScanDesc
+shm_beginscan(int num_queues)
+{
+	ShmScanDesc		shmscan;
+
+	shmscan = palloc(sizeof(ShmScanDescData));
+
+	shmscan->num_shm_queues = num_queues;
+	shmscan->ss_cqueue = -1;
+	shmscan->shmscan_inited	= false;
+
+	return shmscan;
+}
+
+/*
+ * ExecInitWorkerResult -
+ *		Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(TupleDesc tupdesc)
+{
+	worker_result	workerResult;
+	int				i;
+	int	natts = tupdesc->natts;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+	workerResult->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	workerResult->typioparams = palloc(sizeof(Oid) * natts);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &workerResult->typioparams[i]);
+		fmgr_info(receive_function_id, &workerResult->receive_functions[i]);
+	}
+
+	return workerResult;
+}
+
+
+/*
+ * shm_getnext -
+ *		Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(ShmScanDesc shmScan, worker_result resultState,
+			shm_mq_handle **responseq, TupleDesc tupdesc)
+{
+	shm_mq_result	res;
+	char			msgtype;
+	Size			nbytes;
+	void			*data;
+	StringInfoData	msg;
+	int32			pid = 1234;
+	int				queueId = 0;
+
+
+	/*state = palloc0(sizeof(worker_result_state));
+	state->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	state->typioparams = palloc(sizeof(Oid) * natts);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &state->typioparams[i]);
+		fmgr_info(receive_function_id, &state->receive_functions[i]);
+	}*/
+
+	/*
+	 * calculate next starting queue used for fetching tuples
+	 */
+	if(!shmScan->shmscan_inited)
+	{
+		shmScan->shmscan_inited = true;
+		Assert(shmScan->num_shm_queues > 0);
+		queueId = 0;
+		--shmScan->num_shm_queues;
+	}
+	else
+		queueId = shmScan->ss_cqueue;
+
+	/* Initialize message buffer. */
+	initStringInfo(&msg);
+
+	/* Read and processes messages from the shared memory queues. */
+	for(;;)
+	{
+		for (;;)
+		{
+			/*
+			 * mark current queue used for fetching tuples, this is used
+			 * to fetch consecutive tuples from queue used in previous
+			 * fetch.
+			 */
+			shmScan->ss_cqueue = queueId;
+
+			/* Get next message. */
+			res = shm_mq_receive(responseq[queueId], &nbytes, &data, false);
+			if (res != SHM_MQ_SUCCESS)
+				break;
+
+			/*
+			 * Message-parsing routines operate on a null-terminated StringInfo,
+			 * so we must construct one.
+			 */
+			resetStringInfo(&msg);
+			enlargeStringInfo(&msg, nbytes);
+			msg.len = nbytes;
+			memcpy(msg.data, data, nbytes);
+			msg.data[nbytes] = '\0';
+			msgtype = pq_getmsgbyte(&msg);
+
+			/* Dispatch on message type. */
+			switch (msgtype)
+			{
+				case 'E':
+				case 'N':
+					{
+						ErrorData	edata;
+						ErrorContextCallback context;
+
+						/* Parse ErrorResponse or NoticeResponse. */
+						pq_parse_errornotice(&msg, &edata);
+
+						/*
+						 * Limit the maximum error level to ERROR.  We don't want
+						 * a FATAL inside the backend worker to kill the user
+						 * session.
+						 */
+						if (edata.elevel > ERROR)
+							edata.elevel = ERROR;
+
+						/*
+						 * Rethrow the error with an appropriate context method.
+						 * On error, we need to ensure that master backend stop
+						 * all other workers before propagating the error, so
+						 * we need to pass the pid's of all workers, so that same
+						 * can be done in error callback.
+						 * XXX - For now, I am just sending some random number, this
+						 * needs to be fixed.
+						 */
+						context.callback = worker_error_callback;
+						context.arg = (void *) &pid;
+						context.previous = error_context_stack;
+						error_context_stack = &context;
+						ThrowErrorData(&edata);
+						error_context_stack = context.previous;
+
+						break;
+					}
+				case 'A':
+					{
+						/* Propagate NotifyResponse. */
+						pq_putmessage(msg.data[0], &msg.data[1], nbytes - 1);
+						break;
+					}
+				case 'T':
+					{
+						int16	natts = pq_getmsgint(&msg, 2);
+						int16	i;
+
+						if (resultState->has_row_description)
+							elog(ERROR, "multiple RowDescription messages");
+						resultState->has_row_description = true;
+						if (natts != tupdesc->natts)
+							ereport(ERROR,
+									(errcode(ERRCODE_DATATYPE_MISMATCH),
+										errmsg("worker result rowtype does not match "
+										"the specified FROM clause rowtype")));
+
+						for (i = 0; i < natts; ++i)
+						{
+							Oid		type_id;
+
+							(void) pq_getmsgstring(&msg);	/* name */
+							(void) pq_getmsgint(&msg, 4);	/* table OID */
+							(void) pq_getmsgint(&msg, 2);	/* table attnum */
+							type_id = pq_getmsgint(&msg, 4);	/* type OID */
+							(void) pq_getmsgint(&msg, 2);	/* type length */
+							(void) pq_getmsgint(&msg, 4);	/* typmod */
+							(void) pq_getmsgint(&msg, 2);	/* format code */
+
+							if (type_id != tupdesc->attrs[i]->atttypid)
+								ereport(ERROR,
+										(errcode(ERRCODE_DATATYPE_MISMATCH),
+											errmsg("remote query result rowtype does not match "
+											"the specified FROM clause rowtype")));
+						}
+
+						pq_getmsgend(&msg);
+
+						break;
+					}
+				case 'D':
+					{
+						/* Handle DataRow message. */
+						HeapTuple	result;
+
+						result = form_result_tuple(resultState, tupdesc, &msg);
+						return result;
+					}
+				case 'C':
+					{
+						/*
+						 * Handle CommandComplete message. Ignore tags sent by
+						 * worker backend as we are anyway going to use tag of
+						 * master backend for sending the same to client.
+						 */
+						(void) pq_getmsgstring(&msg);
+						break;
+					}
+				case 'G':
+				case 'H':
+				case 'W':
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									errmsg("COPY protocol not allowed in worker")));
+					}
+
+				case 'Z':
+					{
+						/* Handle ReadyForQuery message. */
+						resultState->complete = true;
+						break;
+					}
+				default:
+					elog(WARNING, "unknown message type: %c (%zu bytes)",
+							msg.data[0], nbytes);
+					break;
+			}
+		}
+
+		/* Check whether the connection was broken prematurely. */
+		if (!resultState->complete)
+			ereport(ERROR,
+					(errcode(ERRCODE_CONNECTION_FAILURE),
+					 errmsg("lost connection to worker process with PID %d",
+					 pid)));
+
+		/*
+		 * if we have exhausted data from all worker queues, then terminate
+		 * processing data from queues.
+		 */
+		if (shmScan->num_shm_queues <=0)
+			break;
+		else
+		{
+			++queueId;
+			--shmScan->num_shm_queues;
+			resultState->has_row_description = false;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * form_result_tuple -
+ * Parse a DataRow message and form a result tuple.
+ */
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg)
+{
+	/* Handle DataRow message. */
+	int16	natts = pq_getmsgint(msg, 2);
+	int16	i;
+	Datum  *values = NULL;
+	bool   *isnull = NULL;
+	StringInfoData	buf;
+
+	if (!resultState->has_row_description)
+		elog(ERROR, "DataRow not preceded by RowDescription");
+	if (natts != tupdesc->natts)
+		elog(ERROR, "malformed DataRow");
+	if (natts > 0)
+	{
+		values = palloc(natts * sizeof(Datum));
+		isnull = palloc(natts * sizeof(bool));
+	}
+	initStringInfo(&buf);
+
+	for (i = 0; i < natts; ++i)
+	{
+		int32	bytes = pq_getmsgint(msg, 4);
+
+		if (bytes < 0)
+		{
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											NULL,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = true;
+		}
+		else
+		{
+			resetStringInfo(&buf);
+			appendBinaryStringInfo(&buf, pq_getmsgbytes(msg, bytes), bytes);
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											&buf,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = false;
+		}
+	}
+
+	pq_getmsgend(msg);
+
+	return heap_form_tuple(tupdesc, values, isnull);
+}
\ No newline at end of file
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 332f04a..f158583 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -714,6 +714,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -910,6 +911,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_ParallelSeqScan:
+			pname = sname = "Parallel Seq Scan";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1059,6 +1063,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1325,6 +1330,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_ParallelSeqScan:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((ParallelSeqScan *) plan)->num_workers, es);
+			ExplainPropertyInteger("Number of Blocks Per Workers",
+				((ParallelSeqScan *) plan)->num_blocks_per_worker, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2142,6 +2157,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..9a8ca75 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,7 +21,7 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeSeqscan.o nodeParallelSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index e27c062..a28e74e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeParallelSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +191,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_ParallelSeqScan:
+			result = (PlanState *) ExecInitParallelSeqScan((ParallelSeqScan *) node,
+														   estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +412,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			result = ExecParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +654,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			ExecEndParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/nodeParallelSeqscan.c b/src/backend/executor/nodeParallelSeqscan.c
new file mode 100644
index 0000000..3d651b5
--- /dev/null
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -0,0 +1,441 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeParallelSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeParallelSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecParallelSeqScan				sequentially scans a relation.
+ *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		ExecInitParallelSeqScan			creates and initializes a parallel seqscan node.
+ *		ExecEndParallelSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		SeqNext
+ *
+ *		This is a workhorse for ExecParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ParallelSeqNext(ParallelSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(node->pss_currentShmScanDesc, node->pss_workerResult,
+						node->responseq,
+						node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * ParallelSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+ParallelSeqRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, ParallelSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check
+	 * are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitParallelScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitParallelScanRelation(SeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/* initialize a heapscan */
+	currentScanDesc = heap_beginscan(currentRelation,
+									 estate->es_snapshot,
+									 0,
+									 NULL);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+ParallelSeqScanState *
+ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags)
+{
+	ParallelSeqScanState *parallelscanstate;
+	ShmScanDesc			 currentShmScanDesc;
+	worker_result		 workerResult;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	parallelscanstate = makeNode(ParallelSeqScanState);
+	parallelscanstate->ss.ps.plan = (Plan *) node;
+	parallelscanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &parallelscanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	parallelscanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) parallelscanstate);
+	parallelscanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) parallelscanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &parallelscanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &parallelscanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitParallelScanRelation(&parallelscanstate->ss, estate, eflags);
+
+	parallelscanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&parallelscanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&parallelscanstate->ss);
+
+	/* Initialize the workers required to perform parallel scan. */
+	InitiateWorkers(parallelscanstate->ss.ss_currentRelation->rd_id,
+					node->scan.plan.targetlist,
+					&parallelscanstate->responseq,
+					&parallelscanstate->seg,
+					node->num_blocks_per_worker,
+					node->num_workers);
+
+	
+	/*
+	 * use result tuple descriptor to fetch data from shared memory queues
+	 * as the worker backends would have put the data after projection.
+	 * number of queue's must be equal to number of worker backends.
+	 */
+	currentShmScanDesc = shm_beginscan(node->num_workers);
+	workerResult = ExecInitWorkerResult(parallelscanstate->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor);
+
+	parallelscanstate->pss_currentShmScanDesc = currentShmScanDesc;
+	parallelscanstate->pss_workerResult	= workerResult;
+
+	return parallelscanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelSeqScan(node)
+ *
+ *		Scans the relation sequentially from multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecParallelSeqScan(ParallelSeqScanState *node)
+{
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) ParallelSeqNext,
+					(ExecScanRecheckMtd) ParallelSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndParallelSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndParallelSeqScan(ParallelSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	/* detach from dynamic shared memory. */
+	dsm_detach(node->seg);
+}
+
+/*
+ * EstimateScanRelationIdSpace:
+ * Returns the size needed to store the ScanRelaionId for the current query
+ */
+Size
+EstimateScanRelationIdSpace(Oid relId)
+{
+	Size		size;
+
+	size = sizeof(relId);
+
+	return size;
+}
+
+/*
+ * SerializeScanRelationId:
+ * Dumps the relatinId onto the memory location at start_address.
+ */
+void
+SerializeScanRelationId(Oid relId, Size maxsize, char *start_address)
+{
+	memcpy(start_address, &relId, maxsize);
+}
+
+/*
+ * RestoreScanRelationId:
+ * Reads the relationId from the specified address, restore it into given
+ * relationId.
+ */
+void
+RestoreScanRelationId(Oid *relId, char *start_address)
+{
+	memcpy(relId, start_address, sizeof(Oid));
+}
+
+/*
+ * EstimateTargetListSpace:
+ * Returns the size needed to store the Targetlist for the current query
+ */
+Size
+EstimateTargetListSpace(List *targetList)
+{
+	Size		size;
+
+	/* Add space reqd for saving the data size of the targetlist */
+	size = sizeof(Size);
+
+	size = add_size(size,
+					mul_size(targetList->length, sizeof(TargetEntry)));
+
+	/*
+	 * For now, lets just support for Var type of nodes.
+	 *
+	 * FIXME - we need to traverse target list and allocate depending
+	 *		   on the node type.
+	 */
+	/*size = add_size(size, mul_size(targetList->length, sizeof(Expr)));*/
+	size = add_size(size, mul_size(targetList->length, sizeof(Var)));
+	
+	/*
+	 * Account for column names, we could get exact length for each column
+	 * name, however as NAMEDATALEN is not too big, this seems okay.
+	 */
+	size = add_size(size, mul_size(targetList->length, NAMEDATALEN));
+
+	return size;
+}
+
+/*
+ * SerializeTargetList:
+ * Dumps the each target entry onto the memory location at start_address.
+ */
+void
+SerializeTargetList(List *targetList, Size maxsize, char *start_address)
+{
+	Size	targetListSize;
+	char	*curptr;
+	ListCell   *l;
+	TargetEntry	*srctargetEntry;
+	TargetEntry	*desttargetEntry;
+
+	targetListSize = targetList->length;
+
+	/* copy target list size */
+	memcpy(start_address, &targetListSize, sizeof(targetListSize));
+	curptr = start_address + sizeof(targetListSize);
+	maxsize -= sizeof(targetListSize);
+
+	/* copy each target list entry */
+	foreach(l, (List *) targetList)
+	{
+		maxsize -= sizeof(TargetEntry);
+		if (maxsize < 0)
+			elog(ERROR, "not enough space to serialize given target list");
+		srctargetEntry = (TargetEntry *) lfirst(l);
+
+		desttargetEntry = (TargetEntry *)curptr;
+		memcpy(desttargetEntry, srctargetEntry, sizeof(TargetEntry));
+
+		/*
+		 * For now, lets just support for Var type of nodes.
+		 *
+		 * FIXME - we need to traverse target list and serialize depending
+		 *		   on the node type.
+		 */
+		desttargetEntry->expr = (Expr*) ((char*) desttargetEntry + sizeof(TargetEntry));
+		memcpy(desttargetEntry->expr, srctargetEntry->expr, sizeof(Var));
+		desttargetEntry->resname = 
+			(char*) ((char*) desttargetEntry + sizeof(TargetEntry) + sizeof(Var));
+		memcpy(desttargetEntry->resname, (char*) srctargetEntry->resname,
+			   strlen(srctargetEntry->resname)+1);
+
+		curptr += sizeof(TargetEntry);
+		curptr += sizeof(Var);
+		curptr += sizeof(NAMEDATALEN);
+	}
+}
+
+/*
+ * RestoreTargetList:
+ * Reads the targetlist from the specified address, restore it into given
+ * targetlist.
+ */
+void
+RestoreTargetList(List **targetList, char *start_address)
+{
+	Size	targetListSize;
+	char	*curptr;
+	char	*colname;
+	TargetEntry	*srctargetEntry;
+	TargetEntry	*desttargetEntry;
+
+	memcpy(&targetListSize, start_address, sizeof(targetListSize));
+	curptr = start_address + sizeof(targetListSize);
+
+	while (targetListSize-- > 0)
+	{
+		desttargetEntry = makeNode(TargetEntry);
+		srctargetEntry = (TargetEntry *)curptr;
+		memcpy(desttargetEntry, srctargetEntry, sizeof(TargetEntry));
+
+		desttargetEntry->expr = (Expr*) copyObject((Expr*)((char*)srctargetEntry + sizeof(TargetEntry)));
+		
+		/*
+		 * For now, lets just support for Var type of nodes.
+		 *
+		 * FIXME - we need to traverse target list and restore depending
+		 *		   on the node type.
+		 */
+		colname = (char*)((char*)srctargetEntry + sizeof(TargetEntry) + sizeof(Var));
+
+		desttargetEntry->resname = colname ? pstrdup(colname) : (char*) NULL;
+		
+		*targetList = lappend(*targetList, desttargetEntry);
+
+		curptr += sizeof(TargetEntry);
+		curptr += sizeof(Var);
+		curptr += sizeof(NAMEDATALEN);
+	}
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 53cfda5..131cfc5 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -139,6 +139,22 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 									 0,
 									 NULL);
 
+	/*
+	 * set the scan limits, if requested by plan.  If the end block
+	 * is not specified, then scan all the blocks till end.
+	 */
+	if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber &&
+		((SeqScan *) node->ps.plan)->endblock != InvalidBlockNumber)
+		heap_setscanlimits(currentScanDesc,
+						   ((SeqScan *) node->ps.plan)->startblock,
+						   (((SeqScan *) node->ps.plan)->endblock -
+						   ((SeqScan *) node->ps.plan)->startblock));
+	else if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber)
+			 heap_setscanlimits(currentScanDesc,
+								((SeqScan *) node->ps.plan)->startblock,
+								(currentScanDesc->rs_nblocks -
+								((SeqScan *) node->ps.plan)->startblock));
+
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
 
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c97355e..3a0583a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 659daa2..0296323 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -106,6 +106,8 @@ int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +221,63 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_parallelseqscan
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+	double		spc_seq_page_cost;
+	QualCost	qpqual_cost;
+	Cost		cpu_per_tuple;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	if (!enable_seqscan)
+		startup_cost += disable_cost;
+
+	/* fetch estimated page cost for tablespace containing table */
+	get_tablespace_page_costs(baserel->reltablespace,
+							  NULL,
+							  &spc_seq_page_cost);
+
+	/*
+	 * disk costs
+	 */
+	run_cost += spc_seq_page_cost * baserel->pages;
+
+	/* CPU costs */
+	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
+
+	startup_cost += qpqual_cost.startup;
+	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+	run_cost += cpu_per_tuple * baserel->tuples;
+
+	/*
+	 * We simply assume that cost will be equally shared by parallel
+	 * workers which might not be true especially for doing disk access.
+	 * XXX - We would like to change these values based on some concrete
+	 * tests.
+	 */
+	path->path.startup_cost = startup_cost / nWorkers;
+	path->path.total_cost = (startup_cost + run_cost) / nWorkers;
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..823abbe
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+
+
+/*
+ *	IsTargetListContainNonVars -
+ *		Check if target list contain non-var entries.
+ */
+static bool
+IsTargetListContainNonVars(List *targetlist)
+{
+	ListCell   *l;
+
+	foreach(l, targetlist)
+	{
+		TargetEntry *te = (TargetEntry *) lfirst(l);
+
+		if (!IsA(te, TargetEntry))
+			continue;			/* probably should never happen */
+		if (!IsA(te->expr, Var))
+			return true;
+	}
+	return false;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+	/*
+	 * parallel scan is not supported for joins or queries containg quals.
+	 *
+	 * XXX - There is no reason for not to support quals, so it should be
+	 * supportted in future versions of this patch.
+	 */
+	if (root->simple_rel_array_size > 2 || rel->baserestrictinfo != NULL)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	/*
+	 * parallel scan is not supported for non-var target list.
+	 *
+	 * XXX - This is to keep the implementation simple, we can do this
+	 * in future.  Here we are checking by passing root->parse->targetList
+	 * instead of rel->reltargetlist because rel->targetlist always contains
+	 * Vars (refer build_base_rel_tlists).
+	 */
+	if(IsTargetListContainNonVars(root->parse->targetList))
+	   return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	add_path(rel, (Path *) create_parallelseqscan_path(root, rel,
+													   num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index bf8dbe0..6b54e1b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -57,6 +57,9 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_parallelseqscan_plan(PlannerInfo *root,
+										 ParallelSeqPath *best_path,
+										 List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -99,6 +102,9 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static ParallelSeqScan *make_parallelseqscan(List *qptlist, List *qpqual,
+											 Index scanrelid, int nworkers,
+											 BlockNumber nblocksperworker);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -227,6 +233,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -342,6 +349,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_ParallelSeqScan:
+			plan = (Plan *) create_parallelseqscan_plan(root,
+														(ParallelSeqPath *) best_path,
+														tlist,
+														scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -1132,6 +1146,71 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_worker_seqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by worker
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock)
+{
+	SeqScan    *scan_plan;
+
+	/*
+	 * Pass scan_relid as 1, this is okay for now as sequence scan worker
+	 * is allowed to operate on just one relation.
+	 * XXX - we should ideally get scanrelid from master backend.
+	 */
+	scan_plan = make_seqscan(targetList,
+							 scan_clauses,
+							 1);
+
+	scan_plan->startblock = startBlock;
+	scan_plan->endblock = endBlock;
+	return scan_plan;
+}
+
+/*
+ * create_parallelseqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by 'best_path'
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+static Scan *
+create_parallelseqscan_plan(PlannerInfo *root, ParallelSeqPath *best_path,
+					List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->path.param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_parallelseqscan(tlist,
+											  scan_clauses,
+											  scan_relid,
+											  best_path->num_workers,
+											  best_path->num_blocks_per_worker);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3313,6 +3392,30 @@ make_seqscan(List *qptlist,
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->scanrelid = scanrelid;
+	node->startblock = InvalidBlockNumber;
+	node->endblock = InvalidBlockNumber;
+
+	return node;
+}
+
+static ParallelSeqScan *
+make_parallelseqscan(List *qptlist,
+			   List *qpqual,
+			   Index scanrelid,
+			   int nworkers,
+			   BlockNumber nblocksperworker)
+{
+	ParallelSeqScan *node = makeNode(ParallelSeqScan);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+	node->num_blocks_per_worker = nblocksperworker;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index fb74d6b..49359e3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,55 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+/*
+ * create_worker_seqscan_plannedstmt
+ *	Returns a planned statement to be used by worker for execution.
+ *	Ideally, master backend should form worker's planned statement
+ *	and pass the same to worker, however for now  master backend
+ *	just passes the required information and PlannedStmt is then
+ *	constructed by worker.
+ */
+PlannedStmt	*
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt)
+{
+	AclMode		required_access = ACL_SELECT;
+	RangeTblEntry *rte;
+	SeqScan    *scan_plan;
+	PlannedStmt	*result;
+
+	rte = makeNode(RangeTblEntry);
+	rte->rtekind = RTE_RELATION;
+	rte->relid = workerstmt->relId;
+	rte->relkind = 'r';
+	rte->requiredPerms = required_access;
+
+	scan_plan = create_worker_seqscan_plan(workerstmt->targetList, NIL,
+										   workerstmt->startBlock,
+										   workerstmt->endBlock);
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) scan_plan;
+	result->rtable = list_make1(rte);
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->relationOids = lappend_oid(result->relationOids, rte->relid);;
+	result->invalItems = NIL;
+	result->nParamExec = 0;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index e630d0b..220b92b 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -436,6 +436,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 319e8b2..ce3df40 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,37 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_parallelseqscan_path
+ *	  Creates a path corresponding to a parallel sequential scan, returning the
+ *	  pathnode.
+ */
+ParallelSeqPath *
+create_parallelseqscan_path(PlannerInfo *root, RelOptInfo *rel, int nWorkers)
+{
+	ParallelSeqPath	   *pathnode = makeNode(ParallelSeqPath);
+
+	pathnode->path.pathtype = T_ParallelSeqScan;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->num_workers = nWorkers;
+	/*
+	 * Divide the work equally among all the workers, for cases
+	 * where division is not equal (example if there are total
+	 * 10 blocks and 3 workers, then as per below calculation each
+	 * worker will scan 3 blocks), last worker will be responsible for
+	 * scanning remaining blocks (refer exec_worker_message).
+	 */
+	pathnode->num_blocks_per_worker = rel->pages / nWorkers;
+
+	cost_parallelseqscan(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..3b796dd
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,579 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitiateWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define SHM_PARALLEL_SCAN_QUEUE_SIZE					65536
+
+/*
+ * This structure is stored in the dynamic shared memory segment.  We use
+ * it to determine whether all workers started up OK and successfully
+ * attached to their respective shared message queues.
+ */
+typedef struct
+{
+	slock_t		mutex;
+	int			workers_total;
+	int			workers_attached;
+	int			workers_ready;
+} shm_mq_header;
+
+/* Fixed-size data passed via our dynamic shared memory segment. */
+typedef struct worker_fixed_data
+{
+	Oid	database_id;
+	Oid	authenticated_user_id;
+	Oid	current_user_id;
+	int	sec_context;
+	NameData	database;
+	NameData	authenticated_user;
+} worker_fixed_data;
+
+/* Private state maintained by the launching backend for IPC. */
+typedef struct worker_info
+{
+	pid_t		pid;
+	Oid			current_user_id;
+	dsm_segment *seg;
+	BackgroundWorkerHandle *handle;
+	shm_mq_handle *responseq;
+	bool		consumed;
+} worker_info;
+
+typedef struct
+{
+	int			nworkers;
+	BackgroundWorkerHandle *handle[FLEXIBLE_ARRAY_MEMBER];
+} worker_state;
+
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PG_WORKER_MAGIC				0x50674267
+#define PG_WORKER_KEY_HDR_DATA		0
+#define PG_WORKER_KEY_FIXED_DATA	1
+#define PG_WORKER_KEY_RELID			2
+#define PG_WORKER_KEY_TARGETLIST	3
+#define PG_WORKER_KEY_BLOCKS		4
+#define PG_WORKER_FIXED_NKEYS		5
+
+void
+exec_worker_message(Datum) __attribute__((noreturn));
+
+static void
+setup_dynamic_shared_memory(Oid relId, List *targetList,
+							shm_mq_handle ***responseq,
+							dsm_segment **segp, shm_mq_header **hdrp,
+							BlockNumber numBlocksPerWorker, int nWorkers);
+static worker_state *setup_backend_workers(dsm_segment *seg, int nworkers);
+static void cleanup_background_workers(dsm_segment *seg, Datum arg);
+static void
+wait_for_workers_to_become_ready(worker_state *wstate,
+								 volatile shm_mq_header *hdr);
+static bool check_worker_status(worker_state *wstate);
+static void bkworker_sigterm_handler(SIGNAL_ARGS);
+
+
+/*
+ * InitiateWorkers
+ *		It sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitiateWorkers(Oid relId, List *targetList, shm_mq_handle ***responseqp,
+				dsm_segment **segp, BlockNumber numBlocksPerWorker,
+				int nWorkers)
+{
+	shm_mq_header *hdr;
+	worker_state *wstate;
+	int			i;
+
+	/* Create dynamic shared memory and table of contents. */
+	setup_dynamic_shared_memory(relId, targetList, responseqp,
+								segp, &hdr, numBlocksPerWorker, nWorkers);
+
+	/* Register backend workers. */
+	wstate = setup_backend_workers(*segp, nWorkers);
+
+	for (i = 0; i < nWorkers; ++i)
+		shm_mq_set_handle((*responseqp)[i], wstate->handle[i]);
+
+	/* Wait for workers to become ready. */
+	wait_for_workers_to_become_ready(wstate, hdr);
+}
+
+/*
+ * Set up a dynamic shared memory segment.
+ *
+ * We set up a small control region that contains only a shm_mq_header,
+ * plus one region per message queue.  There are as many message queues as
+ * the number of workers.
+ */
+static void
+setup_dynamic_shared_memory(Oid relId, List *targetList,
+							shm_mq_handle ***responseqp,
+							dsm_segment **segp, shm_mq_header **hdrp,
+							BlockNumber numBlocksPerWorker, int nWorkers)
+{
+	Size		segsize, relid_len, targetlist_len;
+	dsm_segment *seg;
+	shm_toc_estimator e;
+	shm_toc    *toc;
+	worker_fixed_data *fdata;
+	char	   *relidp;
+	char	   *targetlistdata;
+	int		   i;
+	shm_mq	   *mq;
+	shm_mq_header *hdr;
+	BlockNumber	*num_blocks_per_worker;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(nWorkers * sizeof(shm_mq_handle*));
+
+	/* Create dynamic shared memory and table of contents. */
+	shm_toc_initialize_estimator(&e);
+
+	shm_toc_estimate_chunk(&e, sizeof(shm_mq_header));
+
+	shm_toc_estimate_chunk(&e, sizeof(worker_fixed_data));
+
+	relid_len = EstimateScanRelationIdSpace(relId);
+	shm_toc_estimate_chunk(&e, relid_len);
+
+	targetlist_len = EstimateTargetListSpace(targetList);
+	shm_toc_estimate_chunk(&e, targetlist_len);
+
+	shm_toc_estimate_chunk(&e, sizeof(BlockNumber));
+
+	for (i = 0; i < nWorkers; ++i)
+		 shm_toc_estimate_chunk(&e, (Size) SHM_PARALLEL_SCAN_QUEUE_SIZE);
+
+	shm_toc_estimate_keys(&e, PG_WORKER_FIXED_NKEYS + nWorkers);
+
+	segsize = shm_toc_estimate(&e);
+
+	seg = dsm_create(segsize);
+	toc = shm_toc_create(PG_WORKER_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* Set up the header region. */
+	hdr = shm_toc_allocate(toc, sizeof(shm_mq_header));
+	SpinLockInit(&hdr->mutex);
+	hdr->workers_total = nWorkers;
+	hdr->workers_attached = 0;
+	hdr->workers_ready = 0;
+	shm_toc_insert(toc, PG_WORKER_KEY_HDR_DATA, hdr);
+
+	/* Store fixed-size data in dynamic shared memory. */
+	fdata = shm_toc_allocate(toc, sizeof(worker_fixed_data));
+	fdata->database_id = MyDatabaseId;
+	fdata->authenticated_user_id = GetAuthenticatedUserId();
+	GetUserIdAndSecContext(&fdata->current_user_id, &fdata->sec_context);
+	namestrcpy(&fdata->database, get_database_name(MyDatabaseId));
+	namestrcpy(&fdata->authenticated_user,
+			   GetUserNameFromId(fdata->authenticated_user_id));
+	shm_toc_insert(toc, PG_WORKER_KEY_FIXED_DATA, fdata);
+
+	/* Store scan relation id in dynamic shared memory. */
+	relidp = shm_toc_allocate(toc, relid_len + 1);
+	SerializeScanRelationId(relId, relid_len, relidp);
+	shm_toc_insert(toc, PG_WORKER_KEY_RELID, relidp);
+
+	/* Store target list in dynamic shared memory. */
+	targetlistdata = shm_toc_allocate(toc, targetlist_len);
+	SerializeTargetList(targetList, targetlist_len, targetlistdata);
+	shm_toc_insert(toc, PG_WORKER_KEY_TARGETLIST, targetlistdata);
+
+	/* Store blocks to be scanned by each worker in dynamic shared memory. */
+	num_blocks_per_worker = shm_toc_allocate(toc, sizeof(BlockNumber));
+	*num_blocks_per_worker = numBlocksPerWorker;
+	shm_toc_insert(toc, PG_WORKER_KEY_BLOCKS, num_blocks_per_worker);
+
+	/* Establish one message queue per worker in dynamic shared memory. */
+	for (i = 1; i <= nWorkers; ++i)
+	{
+		mq = shm_mq_create(shm_toc_allocate(toc, (Size) SHM_PARALLEL_SCAN_QUEUE_SIZE),
+						   (Size) SHM_PARALLEL_SCAN_QUEUE_SIZE);
+		shm_toc_insert(toc, PG_WORKER_FIXED_NKEYS + i, mq);
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i-1] = shm_mq_attach(mq, seg, NULL);
+	}
+
+	/* Return results to caller. */
+	*segp = seg;
+	*hdrp = hdr;
+}
+
+/*
+ * Register backend workers.
+ */
+static worker_state *
+setup_backend_workers(dsm_segment *seg, int nWorkers)
+{
+	MemoryContext oldcontext;
+	BackgroundWorker worker;
+	worker_state *wstate;
+	int			i;
+
+	/*
+	 * We need the worker_state object and the background worker handles to
+	 * which it points to be allocated in CurTransactionContext rather than
+	 * ExprContext; otherwise, they'll be destroyed before the on_dsm_detach
+	 * hooks run.
+	 */
+	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
+
+	/* Create worker state object. */
+	wstate = MemoryContextAlloc(TopTransactionContext,
+								offsetof(worker_state, handle) +
+								sizeof(BackgroundWorkerHandle *) * nWorkers);
+	wstate->nworkers = 0;
+
+	/*
+	 * Arrange to kill all the workers if we abort before or after all workers
+	 * are finished hooking themselves up to the dynamic shared memory segment.
+	 *
+	 * XXX - For killing workers, we need to have mechanism with which it can be
+	 * done before aborting the transaction.
+	 */
+
+	on_dsm_detach(seg, cleanup_background_workers,
+				  PointerGetDatum(wstate));
+
+	/* Configure a worker. */
+	worker.bgw_flags = 
+		BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_ConsistentState;
+	worker.bgw_restart_time = BGW_NEVER_RESTART;
+	worker.bgw_main = exec_worker_message;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "backend_worker");
+	worker.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(seg));
+	/* set bgw_notify_pid, so we can detect if the worker stops */
+	worker.bgw_notify_pid = MyProcPid;
+
+	/* Register the workers. */
+	for (i = 0; i < nWorkers; ++i)
+	{
+		if (!RegisterDynamicBackgroundWorker(&worker, &wstate->handle[i]))
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("could not register background process"),
+				 errhint("You may need to increase max_worker_processes.")));
+		++wstate->nworkers;
+	}
+
+	/* All done. */
+	MemoryContextSwitchTo(oldcontext);
+	return wstate;
+}
+
+static void
+wait_for_workers_to_become_ready(worker_state *wstate,
+								 volatile shm_mq_header *hdr)
+{
+	bool		save_set_latch_on_sigusr1;
+	bool		result = false;
+
+	save_set_latch_on_sigusr1 = set_latch_on_sigusr1;
+	set_latch_on_sigusr1 = true;
+
+	PG_TRY();
+	{
+		for (;;)
+		{
+			int			workers_ready;
+
+			/* If all the workers are ready, we have succeeded. */
+			SpinLockAcquire(&hdr->mutex);
+			workers_ready = hdr->workers_ready;
+			SpinLockRelease(&hdr->mutex);
+			if (workers_ready >= wstate->nworkers)
+			{
+				result = true;
+				break;
+			}
+
+			/* If any workers (or the postmaster) have died, we have failed. */
+			if (!check_worker_status(wstate))
+			{
+				result = false;
+				break;
+			}
+
+			/* Wait to be signalled. */
+			WaitLatch(&MyProc->procLatch, WL_LATCH_SET, 0);
+
+			/* An interrupt may have occurred while we were waiting. */
+			CHECK_FOR_INTERRUPTS();
+
+			/* Reset the latch so we don't spin. */
+			ResetLatch(&MyProc->procLatch);
+		}
+	}
+	PG_CATCH();
+	{
+		set_latch_on_sigusr1 = save_set_latch_on_sigusr1;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	if (!result)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("one or more background workers failed to start")));
+}
+
+static bool
+check_worker_status(worker_state *wstate)
+{
+	int			n;
+
+	/* If any workers (or the postmaster) have died, we have failed. */
+	for (n = 0; n < wstate->nworkers; ++n)
+	{
+		BgwHandleStatus status;
+		pid_t		pid;
+
+		status = GetBackgroundWorkerPid(wstate->handle[n], &pid);
+		/*if (status == BGWH_STOPPED || status == BGWH_POSTMASTER_DIED)*/
+		/*
+		 * XXX - Do we need to consider BGWH_STOPPED status, if directly return
+		 * false for BGWH_STOPPED, it could very well be possble that worker has
+		 * exited after completing the work in which case the caller of this won't
+		 * wait for other worker's status and main backend will lead to error
+		 * whereas everything is normal for such a case.
+		 */
+		if (status == BGWH_POSTMASTER_DIED)
+			return false;
+	}
+
+	/* Otherwise, things still look OK. */
+	return true;
+}
+
+static void
+cleanup_background_workers(dsm_segment *seg, Datum arg)
+{
+	worker_state *wstate = (worker_state *) arg;
+
+	while (wstate->nworkers > 0)
+	{
+		--wstate->nworkers;
+		TerminateBackgroundWorker(wstate->handle[wstate->nworkers]);
+	}
+}
+
+
+/*
+ * exec_execute_message
+ *
+ * Process an "Execute" message for a portal
+ */
+void
+exec_worker_message(Datum main_arg)
+{
+	dsm_segment *seg;
+	shm_toc     *toc;
+	worker_fixed_data *fdata;
+	char	    *relidp;
+	char	    *targetlistdata;
+	BlockNumber *num_blocks_per_worker;
+	BlockNumber  start_block;
+	BlockNumber  end_block;
+	shm_mq	    *mq;
+	shm_mq_handle *responseq;
+	int			myworkernumber;
+	volatile shm_mq_header *hdr;
+	Oid			relId;
+	List		*targetList = NIL;
+	PGPROC	    *registrant;
+	worker_stmt	*workerstmt;
+
+	/* Establish signal handlers. */
+	pqsignal(SIGTERM, bkworker_sigterm_handler);
+	BackgroundWorkerUnblockSignals();
+
+	/* Set up a memory context and resource owner. */
+	Assert(CurrentResourceOwner == NULL);
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "backend_worker");
+	CurrentMemoryContext = AllocSetContextCreate(TopMemoryContext,
+												 "backend worker",
+												 ALLOCSET_DEFAULT_MINSIZE,
+												 ALLOCSET_DEFAULT_INITSIZE,
+												 ALLOCSET_DEFAULT_MAXSIZE);
+	/*while(1)
+	{
+	}*/
+
+	/* Connect to the dynamic shared memory segment. */
+	seg = dsm_attach(DatumGetInt32(main_arg));
+	if (seg == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to map dynamic shared memory segment")));
+	toc = shm_toc_attach(PG_WORKER_MAGIC, dsm_segment_address(seg));
+	if (toc == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+			   errmsg("bad magic number in dynamic shared memory segment")));
+
+	/* Find data structures in dynamic shared memory. */
+	hdr = shm_toc_lookup(toc, PG_WORKER_KEY_HDR_DATA);
+	fdata = shm_toc_lookup(toc, PG_WORKER_KEY_FIXED_DATA);
+	relidp = shm_toc_lookup(toc, PG_WORKER_KEY_RELID);
+	targetlistdata = shm_toc_lookup(toc, PG_WORKER_KEY_TARGETLIST);
+	num_blocks_per_worker = shm_toc_lookup(toc, PG_WORKER_KEY_BLOCKS);
+
+	/*
+	 * Acquire a worker number.
+	 *
+	 * Our worker number gives our identity: there may be just one
+	 * worker involved in this parallel operation, or there may be many.
+	 */
+	SpinLockAcquire(&hdr->mutex);
+	myworkernumber = ++hdr->workers_attached;
+	SpinLockRelease(&hdr->mutex);
+	if (myworkernumber > hdr->workers_total)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("too many message queue testing workers already")));
+
+	mq = shm_toc_lookup(toc, PG_WORKER_FIXED_NKEYS + myworkernumber);
+	shm_mq_set_sender(mq, MyProc);
+	responseq = shm_mq_attach(mq, seg, NULL);
+
+	end_block = myworkernumber * (*num_blocks_per_worker);
+	start_block = end_block - (*num_blocks_per_worker);
+
+	/*
+	 * Indicate that we're fully initialized and ready to begin the main part
+	 * of the parallel operation.
+	 *
+	 * Once we signal that we're ready, the user backend is entitled to assume
+	 * that our on_dsm_detach callbacks will fire before we disconnect from
+	 * the shared memory segment and exit.  Generally, that means we must have
+	 * attached to all relevant dynamic shared memory data structures by now.
+	 */
+	SpinLockAcquire(&hdr->mutex);
+	++hdr->workers_ready;
+	SpinLockRelease(&hdr->mutex);
+	registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	if (registrant == NULL)
+	{
+		elog(DEBUG1, "registrant backend has exited prematurely");
+		proc_exit(1);
+	}
+	SetLatch(&registrant->procLatch);
+
+
+	/* Redirect protocol messages to responseq. */
+	pq_redirect_to_shm_mq(mq, responseq);
+
+	/*
+	 * Initialize our user and database ID based on the strings version of
+	 * the data, and then go back and check that we actually got the database
+	 * and user ID that we intended to get.  We do this because it's not
+	 * impossible for the process that started us to die before we get here,
+	 * and the user or database could be renamed in the meantime.  We don't
+	 * want to latch on the wrong object by accident.  There should probably
+	 * be a variant of BackgroundWorkerInitializeConnection that accepts OIDs
+	 * rather than strings.
+	 */
+	BackgroundWorkerInitializeConnection(NameStr(fdata->database),
+										 NameStr(fdata->authenticated_user));
+	if (fdata->database_id != MyDatabaseId ||
+		fdata->authenticated_user_id != GetAuthenticatedUserId())
+		ereport(ERROR,
+				(errmsg("user or database renamed during backend worker startup")));
+
+	/* Restore RelationId and TargetList from main backend. */
+	RestoreScanRelationId(&relId, relidp);
+	RestoreTargetList(&targetList, targetlistdata);
+
+	/* Handle local_preload_libraries and session_preload_libraries. */
+	process_session_preload_libraries();
+
+	/* Restore user ID and security context. */
+	SetUserIdAndSecContext(fdata->current_user_id, fdata->sec_context);
+
+	workerstmt = palloc(sizeof(worker_stmt));
+
+	workerstmt->relId = relId;
+	workerstmt->targetList = targetList;
+	workerstmt->startBlock = start_block;
+
+	/* last worker should scan all the remaining blocks. */
+	if (myworkernumber == hdr->workers_total)
+		workerstmt->endBlock = InvalidBlockNumber;
+	else
+		workerstmt->endBlock = end_block;
+
+	/* Execute the worker command. */
+	exec_worker_stmt(workerstmt);
+
+	ProcessCompletedNotifies();
+
+	/* Signal that we are done. */
+	ReadyForQuery(DestRemote);
+
+	proc_exit(1);
+}
+
+/*
+ * When we receive a SIGTERM, we set InterruptPending and ProcDiePending just
+ * like a normal backend.  The next CHECK_FOR_INTERRUPTS() will do the right
+ * thing.
+ */
+static void
+bkworker_sigterm_handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	if (!proc_exit_inprogress)
+	{
+		InterruptPending = true;
+		ProcDiePending = true;
+	}
+
+	errno = save_errno;
+}
\ No newline at end of file
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6220a8e..11db15e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -99,6 +99,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -830,6 +831,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cc62b2c..7de5e0e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -55,6 +55,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1132,6 +1133,105 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * execute_worker_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_worker_stmt(worker_stmt *workerstmt)
+{
+	Portal		portal;
+	int16		format = 1;
+	DestReceiver *receiver;
+	bool		isTopLevel = true;
+	PlannedStmt	*planned_stmt;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+	BeginCommand("SELECT", DestNone);
+
+	/* Make sure we are in a transaction command */
+	start_xact_command();
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	planned_stmt = create_worker_seqscan_plannedstmt(workerstmt);
+	/*
+	 * Create unnamed portal to run the query or queries in. If there
+	 * already is one, silently drop it.
+	 */
+	portal = CreatePortal("", true, true);
+	/* Don't display the portal in pg_cursors */
+	portal->visible = false;
+
+	/*
+	 * We don't have to copy anything into the portal, because everything
+	 * we are passing here is in MessageContext, which will outlive the
+	 * portal anyway.
+	 */
+	PortalDefineQuery(portal,
+					  NULL,
+					  "",
+					  "",
+					  list_make1(planned_stmt),
+					  NULL);
+
+	/*
+	 * Start the portal.  No parameters here.
+	 */
+	PortalStart(portal, NULL, 0, InvalidSnapshot);
+
+	/* We always use binary format, for efficiency. */
+	PortalSetResultFormat(portal, 1, &format);
+
+	receiver = CreateDestReceiver(DestRemote);
+	SetRemoteDestReceiverParams(receiver, portal);
+
+	/*
+	 * Only once the portal and destreceiver have been established can
+	 * we return to the transaction context.  All that stuff needs to
+	 * survive an internal commit inside PortalRun!
+	 */
+	MemoryContextSwitchTo(oldcontext);
+
+	/*
+	 * Run the portal to completion, and then drop it (and the receiver).
+	 */
+	(void) PortalRun(portal,
+					 FETCH_ALL,
+					 isTopLevel,
+					 receiver,
+					 receiver,
+					 NULL);
+
+	(*receiver->rDestroy) (receiver);
+
+	PortalDrop(portal, false);
+
+	finish_xact_command();
+
+	/*
+	 * Send appropriate CommandComplete to client.  There is no
+	 * need to send completion tag from worker as that won't be
+	 * of any use considering the completiong tag of master backend
+	 * will be used for sending to client.
+	 */
+	EndCommand("", DestRemote);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 23cbe90..69de3b8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -619,6 +619,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2425,6 +2427,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4a89cb7..3a6b037 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -494,6 +494,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index f2c7ca1..f88ef2e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,7 +20,6 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
-
 typedef struct HeapScanDescData
 {
 	/* scan parameters */
@@ -105,4 +104,13 @@ typedef struct SysScanDescData
 	Snapshot	snapshot;		/* snapshot to unregister at end of scan */
 }	SysScanDescData;
 
+/* struct for scanning shared memory queues */
+typedef struct ShmScanDescData
+{
+	/* scan current state */
+	int			num_shm_queues;	/* number of shared memory queues used in scan. */
+	int			ss_cqueue;		/* current queue # in scan, if any */
+	bool		shmscan_inited;		/* false = scan not init'd yet */
+}	ShmScanDescData;
+
 #endif   /* RELSCAN_H */
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..aa444bc
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	FmgrInfo   *receive_functions;
+	Oid		   *typioparams;
+	bool		has_row_description;
+	bool		complete;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+typedef struct ShmScanDescData *ShmScanDesc;
+
+extern worker_result ExecInitWorkerResult(TupleDesc tupdesc);
+extern ShmScanDesc shm_beginscan(int num_queues);
+extern HeapTuple shm_getnext(ShmScanDesc shmScan, worker_result resultState,
+							 shm_mq_handle **responseq, TupleDesc tupdesc);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/nodeParallelSeqscan.h b/src/include/executor/nodeParallelSeqscan.h
new file mode 100644
index 0000000..b638a24
--- /dev/null
+++ b/src/include/executor/nodeParallelSeqscan.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeparallelSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeParallelSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARALLELSEQSCAN_H
+#define NODEPARALLELSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern ParallelSeqScanState *ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecParallelSeqScan(ParallelSeqScanState *node);
+extern void ExecEndParallelSeqScan(ParallelSeqScanState *node);
+
+extern Size EstimateScanRelationIdSpace(Oid relId);
+extern void SerializeScanRelationId(Oid relId, Size maxsize,
+									char *start_address);
+extern void RestoreScanRelationId(Oid *relId, char *start_address);
+
+extern Size EstimateTargetListSpace(List *targetList);
+extern void SerializeTargetList(List *targetList, Size maxsize,
+								char *start_address);
+extern void RestoreTargetList(List **targetList, char *start_address);
+
+#endif   /* NODEPARALLELSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41b13b2..19ec043 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,11 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -1212,6 +1214,23 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * ParallelScanState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct ParallelScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	dsm_segment *seg;
+	shm_mq_handle **responseq;
+	ShmScanDesc pss_currentShmScanDesc;
+	worker_result	pss_workerResult;
+} ParallelScanState;
+
+typedef ParallelScanState ParallelSeqScanState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bc71fea..c48df6c 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_ParallelSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +98,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_ParallelSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +219,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_ParallelSeqPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 3e4f815..54efdc1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -23,6 +23,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +157,14 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for execution. */
+typedef struct worker_stmt
+{
+	Oid			relId;
+	List		*targetList;
+	BlockNumber startBlock;
+	BlockNumber endBlock;
+} worker_stmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7f9eaf0..0375ce1 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,7 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -269,6 +270,8 @@ typedef struct Scan
 {
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+	BlockNumber startblock;		/* block to start seq scan */
+	BlockNumber endblock;		/* block upto which scan has to be done */
 } Scan;
 
 /* ----------------
@@ -278,6 +281,17 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct ParallelSeqScan
+{
+	Scan		scan;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqScan;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 810b9c8..3a38270 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct ParallelSeqPath
+{
+	Path		path;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..a738c54 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +69,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 26b17f5..901c792 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern ParallelSeqPath *create_parallelseqscan_path(PlannerInfo *root,
+					RelOptInfo *rel, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index afa5f9b..d2a2760 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 3fdc2cb..b382a27 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -41,6 +41,9 @@ extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  * prototypes for plan/createplan.c
  */
 extern Plan *create_plan(PlannerInfo *root, Path *best_path);
+extern SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 1e942c5..752bd16 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..68f2023
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+extern int	parallel_seqscan_degree;
+extern void InitiateWorkers(Oid relId, List *targetList,
+							shm_mq_handle ***responseqp,
+							dsm_segment **segp,
+							BlockNumber numBlocksPerWorker,
+							int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 60f7532..6087b5e 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -83,5 +83,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_worker_stmt(worker_stmt *workerstmt);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 47ff880..532d2db 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#2José Luis Tallón
jltallon@adv-solutions.net
In reply to: Amit Kapila (#1)
Re: Parallel Seq Scan

On 12/04/2014 07:35 AM, Amit Kapila wrote:

[snip]

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

The number of parallel workers should be capped (of course!) at the
maximum amount of "processors" (cores/vCores, threads/hyperthreads)
available.

More over, when load goes up, the relative cost of parallel working
should go up as well.
Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

In ExecutorStart phase, initiate the required number of workers
as per parallel seq scan plan and setup dynamic shared memory and
share the information required for worker to execute the scan.
Currently I have just shared the relId, targetlist and number
of blocks to be scanned by worker, however I think we might want
to generate a plan for each of the workers in master backend and
then share the same to individual worker.

[snip]

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

Not directly (I haven't had the time to read the code yet), but I'm
thinking about the ability to simply *replace* executor methods from an
extension.
This could be an alternative to providing additional nodes that the
planner can include in the final plan tree, ready to be executed.

The parallel seq scan nodes are definitively the best approach for
"parallel query", since the planner can optimize them based on cost.
I'm wondering about the ability to modify the implementation of some
methods themselves once at execution time: given a previously planned
query, chances are that, at execution time (I'm specifically thinking
about prepared statements here), a different implementation of the same
"node" might be more suitable and could be used instead while the
condition holds.

If this latter line of thinking is too off-topic within this thread and
there is any interest, we can move the comments to another thread and
I'd begin work on a PoC patch. It might as well make sense to implement
the executor overloading mechanism alongide the custom plan API, though.
Any comments appreciated.

Thank you for your work, Amit

Regards,

/ J.L.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Stephen Frost
sfrost@snowman.net
In reply to: José Luis Tallón (#2)
Re: Parallel Seq Scan

José,

* José Luis Tallón (jltallon@adv-solutions.net) wrote:

On 12/04/2014 07:35 AM, Amit Kapila wrote:

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

The number of parallel workers should be capped (of course!) at the
maximum amount of "processors" (cores/vCores, threads/hyperthreads)
available.

More over, when load goes up, the relative cost of parallel working
should go up as well.
Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

While I agree in general that we'll need to come up with appropriate
acceptance criteria, etc, I don't think we want to complicate this patch
with that initially. A SUSET GUC which caps the parallel GUC would be
enough for an initial implementation, imv.

Not directly (I haven't had the time to read the code yet), but I'm
thinking about the ability to simply *replace* executor methods from
an extension.

You probably want to look at the CustomScan thread+patch directly then..

Thanks,

Stephen

#4Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#1)
Re: Parallel Seq Scan

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

postgres=# explain select c1 from t1;
QUERY PLAN
------------------------------------------------------
Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4)
(1 row)

postgres=# set parallel_seqscan_degree=4;
SET
postgres=# explain select c1 from t1;
QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
Number of Workers: 4
Number of Blocks Per Workers: 25
(3 rows)

This is all great and interesting, but I feel like folks might be
waiting to see just what kind of performance results come from this (and
what kind of hardware is needed to see gains..). There's likely to be
situations where this change is an improvement while also being cases
where it makes things worse.

One really interesting case would be parallel seq scans which are
executing against foreign tables/FDWs..

Thanks!

Stephen

#5Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: José Luis Tallón (#2)
Re: Parallel Seq Scan

On 12/5/14, 9:08 AM, José Luis Tallón wrote:

More over, when load goes up, the relative cost of parallel working should go up as well.
Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

...

The parallel seq scan nodes are definitively the best approach for "parallel query", since the planner can optimize them based on cost.
I'm wondering about the ability to modify the implementation of some methods themselves once at execution time: given a previously planned query, chances are that, at execution time (I'm specifically thinking about prepared statements here), a different implementation of the same "node" might be more suitable and could be used instead while the condition holds.

These comments got me wondering... would it be better to decide on parallelism during execution instead of at plan time? That would allow us to dynamically scale parallelism based on system load. If we don't even consider parallelism until we've pulled some number of tuples/pages from a relation, this would also eliminate all parallel overhead on small relations.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: José Luis Tallón (#2)
Re: Parallel Seq Scan

On Fri, Dec 5, 2014 at 8:38 PM, José Luis Tallón <jltallon@adv-solutions.net>
wrote:

On 12/04/2014 07:35 AM, Amit Kapila wrote:

[snip]

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

The number of parallel workers should be capped (of course!) at the

maximum amount of "processors" (cores/vCores, threads/hyperthreads)
available.

Also, it should consider MaxConnections configured by user.

More over, when load goes up, the relative cost of parallel working

should go up as well.

Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

How will you identify load in above formula and what is exactly 'c'
(is it parallel workers involved?).

For now, I have managed this simply by having a configuration
variable and it seems to me that the same should be good
enough for first version, we can definitely enhance it in future
version by dynamically allocating the number of workers based
on their availability and need of query, but I think lets leave that
for another day.

In ExecutorStart phase, initiate the required number of workers
as per parallel seq scan plan and setup dynamic shared memory and
share the information required for worker to execute the scan.
Currently I have just shared the relId, targetlist and number
of blocks to be scanned by worker, however I think we might want
to generate a plan for each of the workers in master backend and
then share the same to individual worker.

[snip]

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

Not directly (I haven't had the time to read the code yet), but I'm

thinking about the ability to simply *replace* executor methods from an
extension.

This could be an alternative to providing additional nodes that the

planner can include in the final plan tree, ready to be executed.

The parallel seq scan nodes are definitively the best approach for

"parallel query", since the planner can optimize them based on cost.

I'm wondering about the ability to modify the implementation of some

methods themselves once at execution time: given a previously planned
query, chances are that, at execution time (I'm specifically thinking about
prepared statements here), a different implementation of the same "node"
might be more suitable and could be used instead while the condition holds.

Idea sounds interesting and I think probably in some cases
different implementation of same node might help, but may be
at this stage if we focus on one kind of implementation (which is
a win for reasonable number of cases) and make it successful,
then doing alternative implementations will be comparatively
easier and have more chances of success.

If this latter line of thinking is too off-topic within this thread and

there is any interest, we can move the comments to another thread and I'd
begin work on a PoC patch. It might as well make sense to implement the
executor overloading mechanism alongide the custom plan API, though.

Sure, please go ahead which ever way you like to proceed.
If you want to contribute in this area/patch, then you are
welcome.

Any comments appreciated.

Thank you for your work, Amit

Many thanks to you as well for showing interest.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7David Rowley
dgrowleyml@gmail.com
In reply to: Amit Kapila (#1)
1 attachment(s)
Re: Parallel Seq Scan

On 4 December 2014 at 19:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

This is good news!
I've not gotten to look at the patch yet, but I thought you may be able to
make use of the attached at some point.

It's bare-bones core support for allowing aggregate states to be merged
together with another aggregate state. I would imagine that if a query such
as:

SELECT MAX(value) FROM bigtable;

was run, then a series of parallel workers could go off and each find the
max value from their portion of the table and then perhaps some other node
type would then take all the intermediate results from the workers, once
they're finished, and join all of the aggregate states into one and return
that. Naturally, you'd need to check that all aggregates used in the
targetlist had a merge function first.

This is just a few hours of work. I've not really tested the pg_dump
support or anything yet. I've also not added any new functions to allow
AVG() or COUNT() to work, I've really just re-used existing functions where
I could, as things like MAX() and BOOL_OR() can just make use of the
existing transition function. I thought that this might be enough for early
tests.

I'd imagine such a workload, ignoring IO overhead, should scale pretty much
linearly with the number of worker processes. Of course, if there was a
GROUP BY clause then the merger code would have to perform more work.

If you think you might be able to make use of this, then I'm willing to go
off and write all the other merge functions required for the other
aggregates.

Regards

David Rowley

Attachments:

merge_aggregate_state_v1.patchapplication/octet-stream; name=merge_aggregate_state_v1.patchDownload
diff --git a/doc/src/sgml/ref/create_aggregate.sgml b/doc/src/sgml/ref/create_aggregate.sgml
index eaa410b..407dc66 100644
--- a/doc/src/sgml/ref/create_aggregate.sgml
+++ b/doc/src/sgml/ref/create_aggregate.sgml
@@ -27,6 +27,8 @@ CREATE AGGREGATE <replaceable class="parameter">name</replaceable> ( [ <replacea
     [ , SSPACE = <replaceable class="PARAMETER">state_data_size</replaceable> ]
     [ , FINALFUNC = <replaceable class="PARAMETER">ffunc</replaceable> ]
     [ , FINALFUNC_EXTRA ]
+    [ , MERGEFUNC = <replaceable class="PARAMETER">mfunc</replaceable> ]
+    [ , MERGEFUNC_EXTRA ]
     [ , INITCOND = <replaceable class="PARAMETER">initial_condition</replaceable> ]
     [ , MSFUNC = <replaceable class="PARAMETER">msfunc</replaceable> ]
     [ , MINVFUNC = <replaceable class="PARAMETER">minvfunc</replaceable> ]
@@ -45,6 +47,8 @@ CREATE AGGREGATE <replaceable class="parameter">name</replaceable> ( [ [ <replac
     [ , SSPACE = <replaceable class="PARAMETER">state_data_size</replaceable> ]
     [ , FINALFUNC = <replaceable class="PARAMETER">ffunc</replaceable> ]
     [ , FINALFUNC_EXTRA ]
+    [ , MERGEFUNC = <replaceable class="PARAMETER">mfunc</replaceable> ]
+    [ , MERGEFUNC_EXTRA ]
     [ , INITCOND = <replaceable class="PARAMETER">initial_condition</replaceable> ]
     [ , HYPOTHETICAL ]
 )
@@ -58,6 +62,8 @@ CREATE AGGREGATE <replaceable class="PARAMETER">name</replaceable> (
     [ , SSPACE = <replaceable class="PARAMETER">state_data_size</replaceable> ]
     [ , FINALFUNC = <replaceable class="PARAMETER">ffunc</replaceable> ]
     [ , FINALFUNC_EXTRA ]
+    [ , MERGEFUNC = <replaceable class="PARAMETER">mfunc</replaceable> ]
+    [ , MERGEFUNC_EXTRA ]
     [ , INITCOND = <replaceable class="PARAMETER">initial_condition</replaceable> ]
     [ , MSFUNC = <replaceable class="PARAMETER">msfunc</replaceable> ]
     [ , MINVFUNC = <replaceable class="PARAMETER">minvfunc</replaceable> ]
diff --git a/src/backend/catalog/pg_aggregate.c b/src/backend/catalog/pg_aggregate.c
index 1ad923c..199b9bd 100644
--- a/src/backend/catalog/pg_aggregate.c
+++ b/src/backend/catalog/pg_aggregate.c
@@ -57,10 +57,12 @@ AggregateCreate(const char *aggName,
 				Oid variadicArgType,
 				List *aggtransfnName,
 				List *aggfinalfnName,
+				List *aggmergefnName,
 				List *aggmtransfnName,
 				List *aggminvtransfnName,
 				List *aggmfinalfnName,
 				bool finalfnExtraArgs,
+				bool mergefnExtraArgs,
 				bool mfinalfnExtraArgs,
 				List *aggsortopName,
 				Oid aggTransType,
@@ -77,6 +79,7 @@ AggregateCreate(const char *aggName,
 	Form_pg_proc proc;
 	Oid			transfn;
 	Oid			finalfn = InvalidOid;	/* can be omitted */
+	Oid			mergefn = InvalidOid;	/* can be omitted */
 	Oid			mtransfn = InvalidOid;	/* can be omitted */
 	Oid			minvtransfn = InvalidOid;		/* can be omitted */
 	Oid			mfinalfn = InvalidOid;	/* can be omitted */
@@ -90,6 +93,7 @@ AggregateCreate(const char *aggName,
 	Oid			fnArgs[FUNC_MAX_ARGS];
 	int			nargs_transfn;
 	int			nargs_finalfn;
+	int			nargs_mergefn;
 	Oid			procOid;
 	TupleDesc	tupDesc;
 	int			i;
@@ -396,6 +400,50 @@ AggregateCreate(const char *aggName,
 	}
 	Assert(OidIsValid(finaltype));
 
+	/* handle the mergefn, if supplied */
+	if (aggmergefnName)
+	{
+		/*
+		 * If mergefnExtraArgs is specified, the transfn takes the transtype
+		 * plus all args; otherwise, it just takes the transtype plus any
+		 * direct args.  (Non-direct args are useless at runtime, and are
+		 * actually passed as NULLs, but we may need them in the function
+		 * signature to allow resolution of a polymorphic agg's result type.)
+		 */
+		Oid			mfnVariadicArgType = variadicArgType;
+
+		/* the 1st and 2nd args must be the trans type */
+		fnArgs[0] = aggTransType;
+		fnArgs[1] = aggTransType;
+		memcpy(fnArgs + 2, aggArgTypes, numArgs * sizeof(Oid));
+		if (mergefnExtraArgs)
+			nargs_mergefn = numArgs + 2;
+		else
+		{
+			nargs_mergefn = numDirectArgs + 2;
+			if (numDirectArgs < numArgs)
+			{
+				/* variadic argument doesn't affect finalfn */
+				mfnVariadicArgType = InvalidOid;
+			}
+		}
+
+		mergefn = lookup_agg_function(aggmergefnName, nargs_mergefn,
+									  fnArgs, mfnVariadicArgType,
+									  &finaltype);
+
+		/*
+		 * When mergefnExtraArgs is specified, the mergefn will certainly be
+		 * passed at least one null argument, so complain if it's strict.
+		 * Nothing bad would happen at runtime (you'd just get a null result),
+		 * but it's surely not what the user wants, so let's complain now.
+		 */
+		if (mergefnExtraArgs && func_strict(mergefn))
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
+					 errmsg("merge function with extra arguments must not be declared STRICT")));
+	}
+
 	/*
 	 * If finaltype (i.e. aggregate return type) is polymorphic, inputs must
 	 * be polymorphic also, else parser will fail to deduce result type.
@@ -423,6 +471,7 @@ AggregateCreate(const char *aggName,
 				 errmsg("unsafe use of pseudo-type \"internal\""),
 				 errdetail("A function returning \"internal\" must have at least one \"internal\" argument.")));
 
+
 	/*
 	 * If a moving-aggregate implementation is supplied, look up its finalfn
 	 * if any, and check that the implied aggregate result type matches the
diff --git a/src/backend/commands/aggregatecmds.c b/src/backend/commands/aggregatecmds.c
index fcf86dd..50d6c6e 100644
--- a/src/backend/commands/aggregatecmds.c
+++ b/src/backend/commands/aggregatecmds.c
@@ -61,10 +61,12 @@ DefineAggregate(List *name, List *args, bool oldstyle, List *parameters,
 	char		aggKind = AGGKIND_NORMAL;
 	List	   *transfuncName = NIL;
 	List	   *finalfuncName = NIL;
+	List	   *mergefuncName = NIL;
 	List	   *mtransfuncName = NIL;
 	List	   *minvtransfuncName = NIL;
 	List	   *mfinalfuncName = NIL;
 	bool		finalfuncExtraArgs = false;
+	bool		mergefuncExtraArgs = false;
 	bool		mfinalfuncExtraArgs = false;
 	List	   *sortoperatorName = NIL;
 	TypeName   *baseType = NULL;
@@ -124,6 +126,8 @@ DefineAggregate(List *name, List *args, bool oldstyle, List *parameters,
 			transfuncName = defGetQualifiedName(defel);
 		else if (pg_strcasecmp(defel->defname, "finalfunc") == 0)
 			finalfuncName = defGetQualifiedName(defel);
+		else if (pg_strcasecmp(defel->defname, "mergefunc") == 0)
+			mergefuncName = defGetQualifiedName(defel);
 		else if (pg_strcasecmp(defel->defname, "msfunc") == 0)
 			mtransfuncName = defGetQualifiedName(defel);
 		else if (pg_strcasecmp(defel->defname, "minvfunc") == 0)
@@ -132,6 +136,8 @@ DefineAggregate(List *name, List *args, bool oldstyle, List *parameters,
 			mfinalfuncName = defGetQualifiedName(defel);
 		else if (pg_strcasecmp(defel->defname, "finalfunc_extra") == 0)
 			finalfuncExtraArgs = defGetBoolean(defel);
+		else if (pg_strcasecmp(defel->defname, "mergefunc_extra") == 0)
+			mergefuncExtraArgs = defGetBoolean(defel);
 		else if (pg_strcasecmp(defel->defname, "mfinalfunc_extra") == 0)
 			mfinalfuncExtraArgs = defGetBoolean(defel);
 		else if (pg_strcasecmp(defel->defname, "sortop") == 0)
@@ -383,10 +389,12 @@ DefineAggregate(List *name, List *args, bool oldstyle, List *parameters,
 						   variadicArgType,
 						   transfuncName,		/* step function name */
 						   finalfuncName,		/* final function name */
+						   mergefuncName,		/* merge function name */
 						   mtransfuncName,		/* fwd trans function name */
 						   minvtransfuncName,	/* inv trans function name */
 						   mfinalfuncName,		/* final function name */
 						   finalfuncExtraArgs,
+						   mergefuncExtraArgs,
 						   mfinalfuncExtraArgs,
 						   sortoperatorName,	/* sort operator name */
 						   transTypeId, /* transition data type */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4175ddc..6569a07 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -11898,10 +11898,12 @@ dumpAgg(Archive *fout, DumpOptions *dopt, AggInfo *agginfo)
 	PGresult   *res;
 	int			i_aggtransfn;
 	int			i_aggfinalfn;
+	int			i_aggmergefn;
 	int			i_aggmtransfn;
 	int			i_aggminvtransfn;
 	int			i_aggmfinalfn;
 	int			i_aggfinalextra;
+	int			i_aggmergeextra;
 	int			i_aggmfinalextra;
 	int			i_aggsortop;
 	int			i_hypothetical;
@@ -11914,10 +11916,12 @@ dumpAgg(Archive *fout, DumpOptions *dopt, AggInfo *agginfo)
 	int			i_convertok;
 	const char *aggtransfn;
 	const char *aggfinalfn;
+	const char *aggmergefn;
 	const char *aggmtransfn;
 	const char *aggminvtransfn;
 	const char *aggmfinalfn;
 	bool		aggfinalextra;
+	bool		aggmergeextra;
 	bool		aggmfinalextra;
 	const char *aggsortop;
 	char	   *aggsortconvop;
@@ -11944,7 +11948,26 @@ dumpAgg(Archive *fout, DumpOptions *dopt, AggInfo *agginfo)
 	selectSourceSchema(fout, agginfo->aggfn.dobj.namespace->dobj.name);
 
 	/* Get aggregate-specific details */
-	if (fout->remoteVersion >= 90400)
+	if (fout->remoteVersion >= 90500) // FIXME 9.5? Maybe 10.0?
+	{
+		appendPQExpBuffer(query, "SELECT aggtransfn, "
+			"aggfinalfn, aggtranstype::pg_catalog.regtype, "
+			"aggmergefn, aggmtransfn, aggminvtransfn, "
+			"aggmfinalfn, aggmtranstype::pg_catalog.regtype, "
+			"aggfinalextra, aggmergeextra, aggmfinalextra, "
+			"aggsortop::pg_catalog.regoperator, "
+			"(aggkind = 'h') AS hypothetical, "
+			"aggtransspace, agginitval, "
+			"aggmtransspace, aggminitval, "
+			"true AS convertok, "
+			"pg_catalog.pg_get_function_arguments(p.oid) AS funcargs, "
+			"pg_catalog.pg_get_function_identity_arguments(p.oid) AS funciargs "
+			"FROM pg_catalog.pg_aggregate a, pg_catalog.pg_proc p "
+			"WHERE a.aggfnoid = p.oid "
+			"AND p.oid = '%u'::pg_catalog.oid",
+			agginfo->aggfn.dobj.catId.oid);
+	}
+	else if (fout->remoteVersion >= 90400)
 	{
 		appendPQExpBuffer(query, "SELECT aggtransfn, "
 						  "aggfinalfn, aggtranstype::pg_catalog.regtype, "
@@ -12054,10 +12077,12 @@ dumpAgg(Archive *fout, DumpOptions *dopt, AggInfo *agginfo)
 
 	i_aggtransfn = PQfnumber(res, "aggtransfn");
 	i_aggfinalfn = PQfnumber(res, "aggfinalfn");
+	i_aggmergefn = PQfnumber(res, "aggmergefn");
 	i_aggmtransfn = PQfnumber(res, "aggmtransfn");
 	i_aggminvtransfn = PQfnumber(res, "aggminvtransfn");
 	i_aggmfinalfn = PQfnumber(res, "aggmfinalfn");
 	i_aggfinalextra = PQfnumber(res, "aggfinalextra");
+	i_aggmergeextra = PQfnumber(res, "aggmergeextra");
 	i_aggmfinalextra = PQfnumber(res, "aggmfinalextra");
 	i_aggsortop = PQfnumber(res, "aggsortop");
 	i_hypothetical = PQfnumber(res, "hypothetical");
@@ -12071,10 +12096,12 @@ dumpAgg(Archive *fout, DumpOptions *dopt, AggInfo *agginfo)
 
 	aggtransfn = PQgetvalue(res, 0, i_aggtransfn);
 	aggfinalfn = PQgetvalue(res, 0, i_aggfinalfn);
+	aggmergefn = PQgetvalue(res, 0, i_aggmergefn);
 	aggmtransfn = PQgetvalue(res, 0, i_aggmtransfn);
 	aggminvtransfn = PQgetvalue(res, 0, i_aggminvtransfn);
 	aggmfinalfn = PQgetvalue(res, 0, i_aggmfinalfn);
 	aggfinalextra = (PQgetvalue(res, 0, i_aggfinalextra)[0] == 't');
+	aggmergeextra = (PQgetvalue(res, 0, i_aggmergeextra)[0] == 't');
 	aggmfinalextra = (PQgetvalue(res, 0, i_aggmfinalextra)[0] == 't');
 	aggsortop = PQgetvalue(res, 0, i_aggsortop);
 	hypothetical = (PQgetvalue(res, 0, i_hypothetical)[0] == 't');
@@ -12159,6 +12186,14 @@ dumpAgg(Archive *fout, DumpOptions *dopt, AggInfo *agginfo)
 			appendPQExpBufferStr(details, ",\n    FINALFUNC_EXTRA");
 	}
 
+	if (strcmp(aggmergefn, "-") != 0)
+	{
+		appendPQExpBuffer(details, ",\n    MERGEFUNC = %s",
+			aggmergefn);
+		if (aggmergeextra)
+			appendPQExpBufferStr(details, ",\n    MERGEFUNC_EXTRA");
+	}
+
 	if (strcmp(aggmtransfn, "-") != 0)
 	{
 		appendPQExpBuffer(details, ",\n    MSFUNC = %s,\n    MINVFUNC = %s,\n    MSTYPE = %s",
diff --git a/src/include/catalog/pg_aggregate.h b/src/include/catalog/pg_aggregate.h
index 3279353..7251b7c 100644
--- a/src/include/catalog/pg_aggregate.h
+++ b/src/include/catalog/pg_aggregate.h
@@ -32,10 +32,12 @@
  *	aggnumdirectargs	number of arguments that are "direct" arguments
  *	aggtransfn			transition function
  *	aggfinalfn			final function (0 if none)
+ *	aggmergefn			merge function (0 if none)
  *	aggmtransfn			forward function for moving-aggregate mode (0 if none)
  *	aggminvtransfn		inverse function for moving-aggregate mode (0 if none)
  *	aggmfinalfn			final function for moving-aggregate mode (0 if none)
  *	aggfinalextra		true to pass extra dummy arguments to aggfinalfn
+ *	aggmergeextra		true to pass extra dummy arguments to aggmergefn
  *	aggmfinalextra		true to pass extra dummy arguments to aggmfinalfn
  *	aggsortop			associated sort operator (0 if none)
  *	aggtranstype		type of aggregate's transition (state) data
@@ -55,10 +57,12 @@ CATALOG(pg_aggregate,2600) BKI_WITHOUT_OIDS
 	int16		aggnumdirectargs;
 	regproc		aggtransfn;
 	regproc		aggfinalfn;
+	regproc		aggmergefn;
 	regproc		aggmtransfn;
 	regproc		aggminvtransfn;
 	regproc		aggmfinalfn;
 	bool		aggfinalextra;
+	bool		aggmergeextra;
 	bool		aggmfinalextra;
 	Oid			aggsortop;
 	Oid			aggtranstype;
@@ -84,24 +88,26 @@ typedef FormData_pg_aggregate *Form_pg_aggregate;
  * ----------------
  */
 
-#define Natts_pg_aggregate					17
+#define Natts_pg_aggregate					19
 #define Anum_pg_aggregate_aggfnoid			1
 #define Anum_pg_aggregate_aggkind			2
 #define Anum_pg_aggregate_aggnumdirectargs	3
 #define Anum_pg_aggregate_aggtransfn		4
 #define Anum_pg_aggregate_aggfinalfn		5
-#define Anum_pg_aggregate_aggmtransfn		6
-#define Anum_pg_aggregate_aggminvtransfn	7
-#define Anum_pg_aggregate_aggmfinalfn		8
-#define Anum_pg_aggregate_aggfinalextra		9
-#define Anum_pg_aggregate_aggmfinalextra	10
-#define Anum_pg_aggregate_aggsortop			11
-#define Anum_pg_aggregate_aggtranstype		12
-#define Anum_pg_aggregate_aggtransspace		13
-#define Anum_pg_aggregate_aggmtranstype		14
-#define Anum_pg_aggregate_aggmtransspace	15
-#define Anum_pg_aggregate_agginitval		16
-#define Anum_pg_aggregate_aggminitval		17
+#define Anum_pg_aggregate_aggmergefn		6
+#define Anum_pg_aggregate_aggmtransfn		7
+#define Anum_pg_aggregate_aggminvtransfn	8
+#define Anum_pg_aggregate_aggmfinalfn		9
+#define Anum_pg_aggregate_aggfinalextra		10
+#define Anum_pg_aggregate_aggmergeextra		11
+#define Anum_pg_aggregate_aggmfinalextra	12
+#define Anum_pg_aggregate_aggsortop			13
+#define Anum_pg_aggregate_aggtranstype		14
+#define Anum_pg_aggregate_aggtransspace		15
+#define Anum_pg_aggregate_aggmtranstype		16
+#define Anum_pg_aggregate_aggmtransspace	17
+#define Anum_pg_aggregate_agginitval		18
+#define Anum_pg_aggregate_aggminitval		19
 
 /*
  * Symbolic values for aggkind column.  We distinguish normal aggregates
@@ -125,180 +131,180 @@ typedef FormData_pg_aggregate *Form_pg_aggregate;
  */
 
 /* avg */
-DATA(insert ( 2100	n 0 int8_avg_accum	numeric_avg		int8_avg_accum	int8_accum_inv	numeric_avg		f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2101	n 0 int4_avg_accum	int8_avg		int4_avg_accum	int4_avg_accum_inv	int8_avg	f f 0	1016	0	1016	0	"{0,0}" "{0,0}" ));
-DATA(insert ( 2102	n 0 int2_avg_accum	int8_avg		int2_avg_accum	int2_avg_accum_inv	int8_avg	f f 0	1016	0	1016	0	"{0,0}" "{0,0}" ));
-DATA(insert ( 2103	n 0 numeric_avg_accum numeric_avg	numeric_avg_accum numeric_accum_inv numeric_avg f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2104	n 0 float4_accum	float8_avg		-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2105	n 0 float8_accum	float8_avg		-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2106	n 0 interval_accum	interval_avg	interval_accum	interval_accum_inv interval_avg f f 0	1187	0	1187	0	"{0 second,0 second}" "{0 second,0 second}" ));
+DATA(insert ( 2100	n 0 int8_avg_accum	numeric_avg		-	int8_avg_accum	int8_accum_inv	numeric_avg		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2101	n 0 int4_avg_accum	int8_avg		-	int4_avg_accum	int4_avg_accum_inv	int8_avg	f f f 0	1016	0	1016	0	"{0,0}" "{0,0}" ));
+DATA(insert ( 2102	n 0 int2_avg_accum	int8_avg		-	int2_avg_accum	int2_avg_accum_inv	int8_avg	f f f 0	1016	0	1016	0	"{0,0}" "{0,0}" ));
+DATA(insert ( 2103	n 0 numeric_avg_accum numeric_avg	-	numeric_avg_accum numeric_accum_inv numeric_avg f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2104	n 0 float4_accum	float8_avg		-	-				-				-				f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2105	n 0 float8_accum	float8_avg		-	-				-				-				f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2106	n 0 interval_accum	interval_avg	-	interval_accum	interval_accum_inv interval_avg f f f 0	1187	0	1187	0	"{0 second,0 second}" "{0 second,0 second}" ));
 
 /* sum */
-DATA(insert ( 2107	n 0 int8_avg_accum	numeric_sum		int8_avg_accum	int8_accum_inv	numeric_sum		f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2108	n 0 int4_sum		-				int4_avg_accum	int4_avg_accum_inv int2int4_sum f f 0	20		0	1016	0	_null_ "{0,0}" ));
-DATA(insert ( 2109	n 0 int2_sum		-				int2_avg_accum	int2_avg_accum_inv int2int4_sum f f 0	20		0	1016	0	_null_ "{0,0}" ));
-DATA(insert ( 2110	n 0 float4pl		-				-				-				-				f f 0	700		0	0		0	_null_ _null_ ));
-DATA(insert ( 2111	n 0 float8pl		-				-				-				-				f f 0	701		0	0		0	_null_ _null_ ));
-DATA(insert ( 2112	n 0 cash_pl			-				cash_pl			cash_mi			-				f f 0	790		0	790		0	_null_ _null_ ));
-DATA(insert ( 2113	n 0 interval_pl		-				interval_pl		interval_mi		-				f f 0	1186	0	1186	0	_null_ _null_ ));
-DATA(insert ( 2114	n 0 numeric_avg_accum	numeric_sum numeric_avg_accum numeric_accum_inv numeric_sum f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2107	n 0 int8_avg_accum	numeric_sum		-			int8_avg_accum	int8_accum_inv	numeric_sum		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2108	n 0 int4_sum		-				-			int4_avg_accum	int4_avg_accum_inv int2int4_sum f f f 0	20		0	1016	0	_null_ "{0,0}" ));
+DATA(insert ( 2109	n 0 int2_sum		-				-			int2_avg_accum	int2_avg_accum_inv int2int4_sum f f f 0	20		0	1016	0	_null_ "{0,0}" ));
+DATA(insert ( 2110	n 0 float4pl		-				float4pl	-				-				-				f f f 0	700		0	0		0	_null_ _null_ ));
+DATA(insert ( 2111	n 0 float8pl		-				float8pl	-				-				-				f f f 0	701		0	0		0	_null_ _null_ ));
+DATA(insert ( 2112	n 0 cash_pl			-				cash_pl		cash_pl			cash_mi			-				f f f 0	790		0	790		0	_null_ _null_ ));
+DATA(insert ( 2113	n 0 interval_pl		-				interval_pl	interval_pl		interval_mi		-				f f f 0	1186	0	1186	0	_null_ _null_ ));
+DATA(insert ( 2114	n 0 numeric_avg_accum	numeric_sum -			numeric_avg_accum numeric_accum_inv numeric_sum f f f 0	2281	128 2281	128 _null_ _null_ ));
 
 /* max */
-DATA(insert ( 2115	n 0 int8larger		-				-				-				-				f f 413		20		0	0		0	_null_ _null_ ));
-DATA(insert ( 2116	n 0 int4larger		-				-				-				-				f f 521		23		0	0		0	_null_ _null_ ));
-DATA(insert ( 2117	n 0 int2larger		-				-				-				-				f f 520		21		0	0		0	_null_ _null_ ));
-DATA(insert ( 2118	n 0 oidlarger		-				-				-				-				f f 610		26		0	0		0	_null_ _null_ ));
-DATA(insert ( 2119	n 0 float4larger	-				-				-				-				f f 623		700		0	0		0	_null_ _null_ ));
-DATA(insert ( 2120	n 0 float8larger	-				-				-				-				f f 674		701		0	0		0	_null_ _null_ ));
-DATA(insert ( 2121	n 0 int4larger		-				-				-				-				f f 563		702		0	0		0	_null_ _null_ ));
-DATA(insert ( 2122	n 0 date_larger		-				-				-				-				f f 1097	1082	0	0		0	_null_ _null_ ));
-DATA(insert ( 2123	n 0 time_larger		-				-				-				-				f f 1112	1083	0	0		0	_null_ _null_ ));
-DATA(insert ( 2124	n 0 timetz_larger	-				-				-				-				f f 1554	1266	0	0		0	_null_ _null_ ));
-DATA(insert ( 2125	n 0 cashlarger		-				-				-				-				f f 903		790		0	0		0	_null_ _null_ ));
-DATA(insert ( 2126	n 0 timestamp_larger	-			-				-				-				f f 2064	1114	0	0		0	_null_ _null_ ));
-DATA(insert ( 2127	n 0 timestamptz_larger	-			-				-				-				f f 1324	1184	0	0		0	_null_ _null_ ));
-DATA(insert ( 2128	n 0 interval_larger -				-				-				-				f f 1334	1186	0	0		0	_null_ _null_ ));
-DATA(insert ( 2129	n 0 text_larger		-				-				-				-				f f 666		25		0	0		0	_null_ _null_ ));
-DATA(insert ( 2130	n 0 numeric_larger	-				-				-				-				f f 1756	1700	0	0		0	_null_ _null_ ));
-DATA(insert ( 2050	n 0 array_larger	-				-				-				-				f f 1073	2277	0	0		0	_null_ _null_ ));
-DATA(insert ( 2244	n 0 bpchar_larger	-				-				-				-				f f 1060	1042	0	0		0	_null_ _null_ ));
-DATA(insert ( 2797	n 0 tidlarger		-				-				-				-				f f 2800	27		0	0		0	_null_ _null_ ));
-DATA(insert ( 3526	n 0 enum_larger		-				-				-				-				f f 3519	3500	0	0		0	_null_ _null_ ));
-DATA(insert ( 3564	n 0 network_larger	-				-				-				-				f f 1205	869		0	0		0	_null_ _null_ ));
+DATA(insert ( 2115	n 0 int8larger		-				int8larger			-				-				-				f f f 413		20		0	0		0	_null_ _null_ ));
+DATA(insert ( 2116	n 0 int4larger		-				int4larger			-				-				-				f f f 521		23		0	0		0	_null_ _null_ ));
+DATA(insert ( 2117	n 0 int2larger		-				int2larger			-				-				-				f f f 520		21		0	0		0	_null_ _null_ ));
+DATA(insert ( 2118	n 0 oidlarger		-				oidlarger			-				-				-				f f f 610		26		0	0		0	_null_ _null_ ));
+DATA(insert ( 2119	n 0 float4larger	-				float4larger		-				-				-				f f f 623		700		0	0		0	_null_ _null_ ));
+DATA(insert ( 2120	n 0 float8larger	-				float8larger		-				-				-				f f f 674		701		0	0		0	_null_ _null_ ));
+DATA(insert ( 2121	n 0 int4larger		-				int4larger			-				-				-				f f f 563		702		0	0		0	_null_ _null_ ));
+DATA(insert ( 2122	n 0 date_larger		-				date_larger			-				-				-				f f f 1097	1082	0	0		0	_null_ _null_ ));
+DATA(insert ( 2123	n 0 time_larger		-				time_larger			-				-				-				f f f 1112	1083	0	0		0	_null_ _null_ ));
+DATA(insert ( 2124	n 0 timetz_larger	-				timetz_larger		-				-				-				f f f 1554	1266	0	0		0	_null_ _null_ ));
+DATA(insert ( 2125	n 0 cashlarger		-				cashlarger			-				-				-				f f f 903		790		0	0		0	_null_ _null_ ));
+DATA(insert ( 2126	n 0 timestamp_larger	-			timestamp_larger	-				-				-				f f f 2064	1114	0	0		0	_null_ _null_ ));
+DATA(insert ( 2127	n 0 timestamptz_larger	-			timestamptz_larger	-				-				-				f f f 1324	1184	0	0		0	_null_ _null_ ));
+DATA(insert ( 2128	n 0 interval_larger -				interval_larger		-				-				-				f f f 1334	1186	0	0		0	_null_ _null_ ));
+DATA(insert ( 2129	n 0 text_larger		-				text_larger			-				-				-				f f f 666		25		0	0		0	_null_ _null_ ));
+DATA(insert ( 2130	n 0 numeric_larger	-				numeric_larger		-				-				-				f f f 1756	1700	0	0		0	_null_ _null_ ));
+DATA(insert ( 2050	n 0 array_larger	-				array_larger		-				-				-				f f f 1073	2277	0	0		0	_null_ _null_ ));
+DATA(insert ( 2244	n 0 bpchar_larger	-				bpchar_larger		-				-				-				f f f 1060	1042	0	0		0	_null_ _null_ ));
+DATA(insert ( 2797	n 0 tidlarger		-				tidlarger			-				-				-				f f f 2800	27		0	0		0	_null_ _null_ ));
+DATA(insert ( 3526	n 0 enum_larger		-				enum_larger			-				-				-				f f f 3519	3500	0	0		0	_null_ _null_ ));
+DATA(insert ( 3564	n 0 network_larger	-				network_larger		-				-				-				f f f 1205	869		0	0		0	_null_ _null_ ));
 
 /* min */
-DATA(insert ( 2131	n 0 int8smaller		-				-				-				-				f f 412		20		0	0		0	_null_ _null_ ));
-DATA(insert ( 2132	n 0 int4smaller		-				-				-				-				f f 97		23		0	0		0	_null_ _null_ ));
-DATA(insert ( 2133	n 0 int2smaller		-				-				-				-				f f 95		21		0	0		0	_null_ _null_ ));
-DATA(insert ( 2134	n 0 oidsmaller		-				-				-				-				f f 609		26		0	0		0	_null_ _null_ ));
-DATA(insert ( 2135	n 0 float4smaller	-				-				-				-				f f 622		700		0	0		0	_null_ _null_ ));
-DATA(insert ( 2136	n 0 float8smaller	-				-				-				-				f f 672		701		0	0		0	_null_ _null_ ));
-DATA(insert ( 2137	n 0 int4smaller		-				-				-				-				f f 562		702		0	0		0	_null_ _null_ ));
-DATA(insert ( 2138	n 0 date_smaller	-				-				-				-				f f 1095	1082	0	0		0	_null_ _null_ ));
-DATA(insert ( 2139	n 0 time_smaller	-				-				-				-				f f 1110	1083	0	0		0	_null_ _null_ ));
-DATA(insert ( 2140	n 0 timetz_smaller	-				-				-				-				f f 1552	1266	0	0		0	_null_ _null_ ));
-DATA(insert ( 2141	n 0 cashsmaller		-				-				-				-				f f 902		790		0	0		0	_null_ _null_ ));
-DATA(insert ( 2142	n 0 timestamp_smaller	-			-				-				-				f f 2062	1114	0	0		0	_null_ _null_ ));
-DATA(insert ( 2143	n 0 timestamptz_smaller -			-				-				-				f f 1322	1184	0	0		0	_null_ _null_ ));
-DATA(insert ( 2144	n 0 interval_smaller	-			-				-				-				f f 1332	1186	0	0		0	_null_ _null_ ));
-DATA(insert ( 2145	n 0 text_smaller	-				-				-				-				f f 664		25		0	0		0	_null_ _null_ ));
-DATA(insert ( 2146	n 0 numeric_smaller -				-				-				-				f f 1754	1700	0	0		0	_null_ _null_ ));
-DATA(insert ( 2051	n 0 array_smaller	-				-				-				-				f f 1072	2277	0	0		0	_null_ _null_ ));
-DATA(insert ( 2245	n 0 bpchar_smaller	-				-				-				-				f f 1058	1042	0	0		0	_null_ _null_ ));
-DATA(insert ( 2798	n 0 tidsmaller		-				-				-				-				f f 2799	27		0	0		0	_null_ _null_ ));
-DATA(insert ( 3527	n 0 enum_smaller	-				-				-				-				f f 3518	3500	0	0		0	_null_ _null_ ));
-DATA(insert ( 3565	n 0 network_smaller -				-				-				-				f f 1203	869		0	0		0	_null_ _null_ ));
+DATA(insert ( 2131	n 0 int8smaller		-				int8smaller			-				-				-				f f f 412		20		0	0		0	_null_ _null_ ));
+DATA(insert ( 2132	n 0 int4smaller		-				int4smaller			-				-				-				f f f 97		23		0	0		0	_null_ _null_ ));
+DATA(insert ( 2133	n 0 int2smaller		-				int2smaller			-				-				-				f f f 95		21		0	0		0	_null_ _null_ ));
+DATA(insert ( 2134	n 0 oidsmaller		-				oidsmaller			-				-				-				f f f 609		26		0	0		0	_null_ _null_ ));
+DATA(insert ( 2135	n 0 float4smaller	-				float4smaller		-				-				-				f f f 622		700		0	0		0	_null_ _null_ ));
+DATA(insert ( 2136	n 0 float8smaller	-				float8smaller		-				-				-				f f f 672		701		0	0		0	_null_ _null_ ));
+DATA(insert ( 2137	n 0 int4smaller		-				int4smaller			-				-				-				f f f 562		702		0	0		0	_null_ _null_ ));
+DATA(insert ( 2138	n 0 date_smaller	-				date_smaller		-				-				-				f f f 1095	1082	0	0		0	_null_ _null_ ));
+DATA(insert ( 2139	n 0 time_smaller	-				time_smaller		-				-				-				f f f 1110	1083	0	0		0	_null_ _null_ ));
+DATA(insert ( 2140	n 0 timetz_smaller	-				timetz_smaller		-				-				-				f f f 1552	1266	0	0		0	_null_ _null_ ));
+DATA(insert ( 2141	n 0 cashsmaller		-				cashsmaller			-				-				-				f f f 902		790		0	0		0	_null_ _null_ ));
+DATA(insert ( 2142	n 0 timestamp_smaller	-			timestamp_smaller	-				-				-				f f f 2062	1114	0	0		0	_null_ _null_ ));
+DATA(insert ( 2143	n 0 timestamptz_smaller -			timestamptz_smaller	-				-				-				f f f 1322	1184	0	0		0	_null_ _null_ ));
+DATA(insert ( 2144	n 0 interval_smaller	-			interval_smaller	-				-				-				f f f 1332	1186	0	0		0	_null_ _null_ ));
+DATA(insert ( 2145	n 0 text_smaller	-				text_smaller		-				-				-				f f f 664		25		0	0		0	_null_ _null_ ));
+DATA(insert ( 2146	n 0 numeric_smaller -				numeric_smaller		-				-				-				f f f 1754	1700	0	0		0	_null_ _null_ ));
+DATA(insert ( 2051	n 0 array_smaller	-				array_smaller		-				-				-				f f f 1072	2277	0	0		0	_null_ _null_ ));
+DATA(insert ( 2245	n 0 bpchar_smaller	-				bpchar_smaller		-				-				-				f f f 1058	1042	0	0		0	_null_ _null_ ));
+DATA(insert ( 2798	n 0 tidsmaller		-				tidsmaller			-				-				-				f f f 2799	27		0	0		0	_null_ _null_ ));
+DATA(insert ( 3527	n 0 enum_smaller	-				enum_smaller		-				-				-				f f f 3518	3500	0	0		0	_null_ _null_ ));
+DATA(insert ( 3565	n 0 network_smaller -				network_smaller		-				-				-				f f f 1203	869		0	0		0	_null_ _null_ ));
 
 /* count */
-DATA(insert ( 2147	n 0 int8inc_any		-				int8inc_any		int8dec_any		-				f f 0		20		0	20		0	"0" "0" ));
-DATA(insert ( 2803	n 0 int8inc			-				int8inc			int8dec			-				f f 0		20		0	20		0	"0" "0" ));
+DATA(insert ( 2147	n 0 int8inc_any		-				-	int8inc_any		int8dec_any		-				f f f 0		20		0	20		0	"0" "0" ));
+DATA(insert ( 2803	n 0 int8inc			-				-	int8inc			int8dec			-				f f f 0		20		0	20		0	"0" "0" ));
 
 /* var_pop */
-DATA(insert ( 2718	n 0 int8_accum	numeric_var_pop		int8_accum		int8_accum_inv	numeric_var_pop f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2719	n 0 int4_accum	numeric_var_pop		int4_accum		int4_accum_inv	numeric_var_pop f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2720	n 0 int2_accum	numeric_var_pop		int2_accum		int2_accum_inv	numeric_var_pop f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2721	n 0 float4_accum	float8_var_pop	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2722	n 0 float8_accum	float8_var_pop	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2723	n 0 numeric_accum	numeric_var_pop numeric_accum numeric_accum_inv numeric_var_pop f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2718	n 0 int8_accum	numeric_var_pop		-	int8_accum		int8_accum_inv	numeric_var_pop f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2719	n 0 int4_accum	numeric_var_pop		-	int4_accum		int4_accum_inv	numeric_var_pop f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2720	n 0 int2_accum	numeric_var_pop		-	int2_accum		int2_accum_inv	numeric_var_pop f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2721	n 0 float4_accum	float8_var_pop	-	-				-				-				f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2722	n 0 float8_accum	float8_var_pop	-	-				-				-				f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2723	n 0 numeric_accum	numeric_var_pop -	numeric_accum numeric_accum_inv numeric_var_pop f f f 0	2281	128 2281	128 _null_ _null_ ));
 
 /* var_samp */
-DATA(insert ( 2641	n 0 int8_accum	numeric_var_samp	int8_accum		int8_accum_inv	numeric_var_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2642	n 0 int4_accum	numeric_var_samp	int4_accum		int4_accum_inv	numeric_var_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2643	n 0 int2_accum	numeric_var_samp	int2_accum		int2_accum_inv	numeric_var_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2644	n 0 float4_accum	float8_var_samp -				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2645	n 0 float8_accum	float8_var_samp -				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2646	n 0 numeric_accum	numeric_var_samp numeric_accum numeric_accum_inv numeric_var_samp f f 0 2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2641	n 0 int8_accum	numeric_var_samp		-	int8_accum		int8_accum_inv	numeric_var_samp	f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2642	n 0 int4_accum	numeric_var_samp		-	int4_accum		int4_accum_inv	numeric_var_samp	f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2643	n 0 int2_accum	numeric_var_samp		-	int2_accum		int2_accum_inv	numeric_var_samp	f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2644	n 0 float4_accum	float8_var_samp		-	-				-				-					f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2645	n 0 float8_accum	float8_var_samp		-	-				-				-					f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2646	n 0 numeric_accum	numeric_var_samp	- numeric_accum numeric_accum_inv numeric_var_samp		f f f 0 2281	128 2281	128 _null_ _null_ ));
 
 /* variance: historical Postgres syntax for var_samp */
-DATA(insert ( 2148	n 0 int8_accum	numeric_var_samp	int8_accum		int8_accum_inv	numeric_var_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2149	n 0 int4_accum	numeric_var_samp	int4_accum		int4_accum_inv	numeric_var_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2150	n 0 int2_accum	numeric_var_samp	int2_accum		int2_accum_inv	numeric_var_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2151	n 0 float4_accum	float8_var_samp -				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2152	n 0 float8_accum	float8_var_samp -				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2153	n 0 numeric_accum	numeric_var_samp numeric_accum numeric_accum_inv numeric_var_samp f f 0 2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2148	n 0 int8_accum	numeric_var_samp		-	int8_accum		int8_accum_inv	numeric_var_samp	f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2149	n 0 int4_accum	numeric_var_samp		-	int4_accum		int4_accum_inv	numeric_var_samp	f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2150	n 0 int2_accum	numeric_var_samp		-	int2_accum		int2_accum_inv	numeric_var_samp	f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2151	n 0 float4_accum	float8_var_samp		-	-				-				-					f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2152	n 0 float8_accum	float8_var_samp		-	-				-				-					f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2153	n 0 numeric_accum	numeric_var_samp	-	numeric_accum numeric_accum_inv numeric_var_samp	f f f 0 2281	128 2281	128 _null_ _null_ ));
 
 /* stddev_pop */
-DATA(insert ( 2724	n 0 int8_accum	numeric_stddev_pop		int8_accum	int8_accum_inv	numeric_stddev_pop	f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2725	n 0 int4_accum	numeric_stddev_pop		int4_accum	int4_accum_inv	numeric_stddev_pop	f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2726	n 0 int2_accum	numeric_stddev_pop		int2_accum	int2_accum_inv	numeric_stddev_pop	f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2727	n 0 float4_accum	float8_stddev_pop	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2728	n 0 float8_accum	float8_stddev_pop	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2729	n 0 numeric_accum	numeric_stddev_pop numeric_accum numeric_accum_inv numeric_stddev_pop f f 0 2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2724	n 0 int8_accum	numeric_stddev_pop		-	int8_accum	int8_accum_inv	numeric_stddev_pop		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2725	n 0 int4_accum	numeric_stddev_pop		-	int4_accum	int4_accum_inv	numeric_stddev_pop		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2726	n 0 int2_accum	numeric_stddev_pop		-	int2_accum	int2_accum_inv	numeric_stddev_pop		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2727	n 0 float4_accum	float8_stddev_pop	-	-			-				-						f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2728	n 0 float8_accum	float8_stddev_pop	-	-			-				-						f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2729	n 0 numeric_accum	numeric_stddev_pop	-	numeric_accum numeric_accum_inv numeric_stddev_pop	f f f 0 2281	128 2281	128 _null_ _null_ ));
 
 /* stddev_samp */
-DATA(insert ( 2712	n 0 int8_accum	numeric_stddev_samp		int8_accum	int8_accum_inv	numeric_stddev_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2713	n 0 int4_accum	numeric_stddev_samp		int4_accum	int4_accum_inv	numeric_stddev_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2714	n 0 int2_accum	numeric_stddev_samp		int2_accum	int2_accum_inv	numeric_stddev_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2715	n 0 float4_accum	float8_stddev_samp	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2716	n 0 float8_accum	float8_stddev_samp	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2717	n 0 numeric_accum	numeric_stddev_samp numeric_accum numeric_accum_inv numeric_stddev_samp f f 0 2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2712	n 0 int8_accum	numeric_stddev_samp		-	int8_accum	int8_accum_inv	numeric_stddev_samp		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2713	n 0 int4_accum	numeric_stddev_samp		-	int4_accum	int4_accum_inv	numeric_stddev_samp		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2714	n 0 int2_accum	numeric_stddev_samp		-	int2_accum	int2_accum_inv	numeric_stddev_samp		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2715	n 0 float4_accum	float8_stddev_samp	-	-			-				-						f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2716	n 0 float8_accum	float8_stddev_samp	-	-			-				-						f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2717	n 0 numeric_accum	numeric_stddev_samp	-	numeric_accum numeric_accum_inv numeric_stddev_samp	f f f 0 2281	128 2281	128 _null_ _null_ ));
 
 /* stddev: historical Postgres syntax for stddev_samp */
-DATA(insert ( 2154	n 0 int8_accum	numeric_stddev_samp		int8_accum	int8_accum_inv	numeric_stddev_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2155	n 0 int4_accum	numeric_stddev_samp		int4_accum	int4_accum_inv	numeric_stddev_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2156	n 0 int2_accum	numeric_stddev_samp		int2_accum	int2_accum_inv	numeric_stddev_samp f f 0	2281	128 2281	128 _null_ _null_ ));
-DATA(insert ( 2157	n 0 float4_accum	float8_stddev_samp	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2158	n 0 float8_accum	float8_stddev_samp	-				-				-				f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
-DATA(insert ( 2159	n 0 numeric_accum	numeric_stddev_samp numeric_accum numeric_accum_inv numeric_stddev_samp f f 0 2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2154	n 0 int8_accum	numeric_stddev_samp		-	int8_accum	int8_accum_inv	numeric_stddev_samp		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2155	n 0 int4_accum	numeric_stddev_samp		-	int4_accum	int4_accum_inv	numeric_stddev_samp		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2156	n 0 int2_accum	numeric_stddev_samp		-	int2_accum	int2_accum_inv	numeric_stddev_samp		f f f 0	2281	128 2281	128 _null_ _null_ ));
+DATA(insert ( 2157	n 0 float4_accum	float8_stddev_samp	-	-			-				-						f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2158	n 0 float8_accum	float8_stddev_samp	-	-			-				-						f f f 0	1022	0	0		0	"{0,0,0}" _null_ ));
+DATA(insert ( 2159	n 0 numeric_accum	numeric_stddev_samp	-	numeric_accum numeric_accum_inv numeric_stddev_samp	f f f 0 2281	128 2281	128 _null_ _null_ ));
 
 /* SQL2003 binary regression aggregates */
-DATA(insert ( 2818	n 0 int8inc_float8_float8	-					-				-				-				f f 0	20		0	0		0	"0" _null_ ));
-DATA(insert ( 2819	n 0 float8_regr_accum	float8_regr_sxx			-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2820	n 0 float8_regr_accum	float8_regr_syy			-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2821	n 0 float8_regr_accum	float8_regr_sxy			-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2822	n 0 float8_regr_accum	float8_regr_avgx		-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2823	n 0 float8_regr_accum	float8_regr_avgy		-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2824	n 0 float8_regr_accum	float8_regr_r2			-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2825	n 0 float8_regr_accum	float8_regr_slope		-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2826	n 0 float8_regr_accum	float8_regr_intercept	-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2827	n 0 float8_regr_accum	float8_covar_pop		-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2828	n 0 float8_regr_accum	float8_covar_samp		-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
-DATA(insert ( 2829	n 0 float8_regr_accum	float8_corr				-				-				-				f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2818	n 0 int8inc_float8_float8	-					-	-				-				-			f f f 0	20		0	0		0	"0" _null_ ));
+DATA(insert ( 2819	n 0 float8_regr_accum	float8_regr_sxx			-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2820	n 0 float8_regr_accum	float8_regr_syy			-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2821	n 0 float8_regr_accum	float8_regr_sxy			-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2822	n 0 float8_regr_accum	float8_regr_avgx		-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2823	n 0 float8_regr_accum	float8_regr_avgy		-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2824	n 0 float8_regr_accum	float8_regr_r2			-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2825	n 0 float8_regr_accum	float8_regr_slope		-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2826	n 0 float8_regr_accum	float8_regr_intercept	-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2827	n 0 float8_regr_accum	float8_covar_pop		-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2828	n 0 float8_regr_accum	float8_covar_samp		-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
+DATA(insert ( 2829	n 0 float8_regr_accum	float8_corr				-	-				-				-			f f f 0	1022	0	0		0	"{0,0,0,0,0,0}" _null_ ));
 
 /* boolean-and and boolean-or */
-DATA(insert ( 2517	n 0 booland_statefunc	-			bool_accum		bool_accum_inv	bool_alltrue	f f 58	16		0	2281	16	_null_ _null_ ));
-DATA(insert ( 2518	n 0 boolor_statefunc	-			bool_accum		bool_accum_inv	bool_anytrue	f f 59	16		0	2281	16	_null_ _null_ ));
-DATA(insert ( 2519	n 0 booland_statefunc	-			bool_accum		bool_accum_inv	bool_alltrue	f f 58	16		0	2281	16	_null_ _null_ ));
+DATA(insert ( 2517	n 0 booland_statefunc	-	-	bool_accum	bool_accum_inv	bool_alltrue	f f f 58	16		0	2281	16	_null_ _null_ ));
+DATA(insert ( 2518	n 0 boolor_statefunc	-	-	bool_accum	bool_accum_inv	bool_anytrue	f f f 59	16		0	2281	16	_null_ _null_ ));
+DATA(insert ( 2519	n 0 booland_statefunc	-	-	bool_accum	bool_accum_inv	bool_alltrue	f f f 58	16		0	2281	16	_null_ _null_ ));
 
 /* bitwise integer */
-DATA(insert ( 2236	n 0 int2and		-					-				-				-				f f 0	21		0	0		0	_null_ _null_ ));
-DATA(insert ( 2237	n 0 int2or		-					-				-				-				f f 0	21		0	0		0	_null_ _null_ ));
-DATA(insert ( 2238	n 0 int4and		-					-				-				-				f f 0	23		0	0		0	_null_ _null_ ));
-DATA(insert ( 2239	n 0 int4or		-					-				-				-				f f 0	23		0	0		0	_null_ _null_ ));
-DATA(insert ( 2240	n 0 int8and		-					-				-				-				f f 0	20		0	0		0	_null_ _null_ ));
-DATA(insert ( 2241	n 0 int8or		-					-				-				-				f f 0	20		0	0		0	_null_ _null_ ));
-DATA(insert ( 2242	n 0 bitand		-					-				-				-				f f 0	1560	0	0		0	_null_ _null_ ));
-DATA(insert ( 2243	n 0 bitor		-					-				-				-				f f 0	1560	0	0		0	_null_ _null_ ));
+DATA(insert ( 2236	n 0 int2and		-					int2and	-				-				-				f f f 0	21		0	0		0	_null_ _null_ ));
+DATA(insert ( 2237	n 0 int2or		-					int2or	-				-				-				f f f 0	21		0	0		0	_null_ _null_ ));
+DATA(insert ( 2238	n 0 int4and		-					int4and	-				-				-				f f f 0	23		0	0		0	_null_ _null_ ));
+DATA(insert ( 2239	n 0 int4or		-					int4or	-				-				-				f f f 0	23		0	0		0	_null_ _null_ ));
+DATA(insert ( 2240	n 0 int8and		-					int8and	-				-				-				f f f 0	20		0	0		0	_null_ _null_ ));
+DATA(insert ( 2241	n 0 int8or		-					int8or	-				-				-				f f f 0	20		0	0		0	_null_ _null_ ));
+DATA(insert ( 2242	n 0 bitand		-					bitand	-				-				-				f f f 0	1560	0	0		0	_null_ _null_ ));
+DATA(insert ( 2243	n 0 bitor		-					bitor	-				-				-				f f f 0	1560	0	0		0	_null_ _null_ ));
 
 /* xml */
-DATA(insert ( 2901	n 0 xmlconcat2	-					-				-				-				f f 0	142		0	0		0	_null_ _null_ ));
+DATA(insert ( 2901	n 0 xmlconcat2	-					-	-				-				-				f f f 0	142		0	0		0	_null_ _null_ ));
 
 /* array */
-DATA(insert ( 2335	n 0 array_agg_transfn	array_agg_finalfn	-				-				-				t f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 4053	n 0 array_agg_array_transfn array_agg_array_finalfn -		-				-				t f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 2335	n 0 array_agg_transfn		array_agg_finalfn		-	-		-				-				t f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 4053	n 0 array_agg_array_transfn array_agg_array_finalfn	-	-		-				-				t f f 0	2281	0	0		0	_null_ _null_ ));
 
 /* text */
-DATA(insert ( 3538	n 0 string_agg_transfn	string_agg_finalfn	-				-				-				f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3538	n 0 string_agg_transfn	string_agg_finalfn	-	-				-				-				f f f 0	2281	0	0		0	_null_ _null_ ));
 
 /* bytea */
-DATA(insert ( 3545	n 0 bytea_string_agg_transfn	bytea_string_agg_finalfn	-				-				-		f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3545	n 0 bytea_string_agg_transfn	bytea_string_agg_finalfn	-	-				-				-		f f f 0	2281	0	0		0	_null_ _null_ ));
 
 /* json */
-DATA(insert ( 3175	n 0 json_agg_transfn	json_agg_finalfn			-				-				-				f f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3197	n 0 json_object_agg_transfn json_object_agg_finalfn -				-				-				f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3175	n 0 json_agg_transfn	json_agg_finalfn			-	-				-				-				f f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3197	n 0 json_object_agg_transfn json_object_agg_finalfn -	-				-				-				f f f 0	2281	0	0		0	_null_ _null_ ));
 
 /* ordered-set and hypothetical-set aggregates */
-DATA(insert ( 3972	o 1 ordered_set_transition			percentile_disc_final					-		-		-		t f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3974	o 1 ordered_set_transition			percentile_cont_float8_final			-		-		-		f f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3976	o 1 ordered_set_transition			percentile_cont_interval_final			-		-		-		f f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3978	o 1 ordered_set_transition			percentile_disc_multi_final				-		-		-		t f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3980	o 1 ordered_set_transition			percentile_cont_float8_multi_final		-		-		-		f f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3982	o 1 ordered_set_transition			percentile_cont_interval_multi_final	-		-		-		f f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3984	o 0 ordered_set_transition			mode_final								-		-		-		t f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3986	h 1 ordered_set_transition_multi	rank_final								-		-		-		t f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3988	h 1 ordered_set_transition_multi	percent_rank_final						-		-		-		t f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3990	h 1 ordered_set_transition_multi	cume_dist_final							-		-		-		t f 0	2281	0	0		0	_null_ _null_ ));
-DATA(insert ( 3992	h 1 ordered_set_transition_multi	dense_rank_final						-		-		-		t f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3972	o 1 ordered_set_transition			percentile_disc_final					-	-		-		-		t f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3974	o 1 ordered_set_transition			percentile_cont_float8_final			-	-		-		-		f f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3976	o 1 ordered_set_transition			percentile_cont_interval_final			-	-		-		-		f f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3978	o 1 ordered_set_transition			percentile_disc_multi_final				-	-		-		-		t f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3980	o 1 ordered_set_transition			percentile_cont_float8_multi_final		-	-		-		-		f f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3982	o 1 ordered_set_transition			percentile_cont_interval_multi_final	-	-		-		-		f f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3984	o 0 ordered_set_transition			mode_final								-	-		-		-		t f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3986	h 1 ordered_set_transition_multi	rank_final								-	-		-		-		t f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3988	h 1 ordered_set_transition_multi	percent_rank_final						-	-		-		-		t f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3990	h 1 ordered_set_transition_multi	cume_dist_final							-	-		-		-		t f f 0	2281	0	0		0	_null_ _null_ ));
+DATA(insert ( 3992	h 1 ordered_set_transition_multi	dense_rank_final						-	-		-		-		t f f 0	2281	0	0		0	_null_ _null_ ));
 
 
 /*
@@ -317,10 +323,12 @@ extern Oid AggregateCreate(const char *aggName,
 				Oid variadicArgType,
 				List *aggtransfnName,
 				List *aggfinalfnName,
+				List *aggmergefnName,
 				List *aggmtransfnName,
 				List *aggminvtransfnName,
 				List *aggmfinalfnName,
 				bool finalfnExtraArgs,
+				bool mergefnExtraArgs,
 				bool mfinalfnExtraArgs,
 				List *aggsortopName,
 				Oid aggTransType,
diff --git a/src/test/regress/expected/create_aggregate.out b/src/test/regress/expected/create_aggregate.out
index 82a34fb..3446b00 100644
--- a/src/test/regress/expected/create_aggregate.out
+++ b/src/test/regress/expected/create_aggregate.out
@@ -101,6 +101,13 @@ CREATE AGGREGATE sumdouble (float8)
     msfunc = float8pl,
     minvfunc = float8mi
 );
+-- aggregate merge functions
+CREATE AGGREGATE mymax (int)
+(
+	stype = int4,
+	sfunc = int4larger,
+	mergefunc = int4larger
+);
 -- invalid: nonstrict inverse with strict forward function
 CREATE FUNCTION float8mi_n(float8, float8) RETURNS float8 AS
 $$ SELECT $1 - $2; $$
diff --git a/src/test/regress/sql/create_aggregate.sql b/src/test/regress/sql/create_aggregate.sql
index 0ec1572..1c18ffd 100644
--- a/src/test/regress/sql/create_aggregate.sql
+++ b/src/test/regress/sql/create_aggregate.sql
@@ -115,6 +115,14 @@ CREATE AGGREGATE sumdouble (float8)
     minvfunc = float8mi
 );
 
+-- aggregate merge functions
+CREATE AGGREGATE mymax (int)
+(
+	stype = int4,
+	sfunc = int4larger,
+	mergefunc = int4larger
+);
+
 -- invalid: nonstrict inverse with strict forward function
 
 CREATE FUNCTION float8mi_n(float8, float8) RETURNS float8 AS
#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#4)
Re: Parallel Seq Scan

On Fri, Dec 5, 2014 at 8:46 PM, Stephen Frost <sfrost@snowman.net> wrote:

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

postgres=# explain select c1 from t1;
QUERY PLAN
------------------------------------------------------
Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4)
(1 row)

postgres=# set parallel_seqscan_degree=4;
SET
postgres=# explain select c1 from t1;
QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
Number of Workers: 4
Number of Blocks Per Workers: 25
(3 rows)

This is all great and interesting, but I feel like folks might be
waiting to see just what kind of performance results come from this (and
what kind of hardware is needed to see gains..).

Initially I was thinking that first we should discuss if the design
and idea used in patch is sane, but now as you have asked and
even Robert has asked the same off list to me, I will take the
performance data next week (Another reason why I have not
taken any data is that still the work to push qualification down
to workers is left which I feel is quite important). However I still
think if I get some feedback on some of the basic things like below,
it would be good.

1. As the patch currently stands, it just shares the relevant
data (like relid, target list, block range each worker should
perform on etc.) to the worker and then worker receives that
data and form the planned statement which it will execute and
send the results back to master backend. So the question
here is do you think it is reasonable or should we try to form
the complete plan for each worker and then share the same
and may be other information as well like range table entries
which are required. My personal gut feeling in this matter
is that for long term it might be better to form the complete
plan of each worker in master and share the same, however
I think the current way as done in patch (okay that needs
some improvement) is also not bad and quite easier to implement.

2. Next question related to above is what should be the
output of ExplainPlan, as currently worker is responsible
for forming its own plan, Explain Plan is not able to show
the detailed plan for each worker, is that okay?

3. Some places where optimizations are possible:
- Currently after getting the tuple from heap, it is deformed by
worker and sent via message queue to master backend, master
backend then forms the tuple and send it to upper layer which
before sending it to frontend again deforms it via slot_getallattrs(slot).
- Master backend currently receives the data from multiple workers
serially. We can optimize in a way that it can check other queues,
if there is no data in current queue.
- Master backend is just responsible for coordination among workers
It shares the required information to workers and then fetch the
data processed by each worker, by using some more logic, we might
be able to make master backend also fetch data from heap rather than
doing just co-ordination among workers.

I think in all above places we can do some optimisation, however
we can do that later as well, unless they hit the performance badly for
cases which people care most.

4. Should parallel_seqscan_degree value be dependent on other
backend processes like MaxConnections, max_worker_processes,
autovacuum_max_workers do or should it be independent like
max_wal_senders?

I think it is better to keep it dependent on other backend processes,
however for simplicity, I have kept it similar to max_wal_senders for now.

There's likely to be
situations where this change is an improvement while also being cases
where it makes things worse.

Agreed and I think that will be more clear after doing some
performance tests.

One really interesting case would be parallel seq scans which are
executing against foreign tables/FDWs..

Sure.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#3)
Re: Parallel Seq Scan

On Fri, Dec 5, 2014 at 8:43 PM, Stephen Frost <sfrost@snowman.net> wrote:

José,

* José Luis Tallón (jltallon@adv-solutions.net) wrote:

On 12/04/2014 07:35 AM, Amit Kapila wrote:

The number of worker backends that can be used for
parallel seq scan can be configured by using a new GUC
parallel_seqscan_degree, the default value of which is zero
and it means parallel seq scan will not be considered unless
user configures this value.

The number of parallel workers should be capped (of course!) at the
maximum amount of "processors" (cores/vCores, threads/hyperthreads)
available.

More over, when load goes up, the relative cost of parallel working
should go up as well.
Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

While I agree in general that we'll need to come up with appropriate
acceptance criteria, etc, I don't think we want to complicate this patch
with that initially.

A SUSET GUC which caps the parallel GUC would be
enough for an initial implementation, imv.

This is exactly what I have done in patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Jim Nasby (#5)
Re: Parallel Seq Scan

On Sat, Dec 6, 2014 at 12:27 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 12/5/14, 9:08 AM, José Luis Tallón wrote:

More over, when load goes up, the relative cost of parallel working

should go up as well.

Something like:
p = number of cores
l = 1min-load

additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

...

The parallel seq scan nodes are definitively the best approach for

"parallel query", since the planner can optimize them based on cost.

I'm wondering about the ability to modify the implementation of some

methods themselves once at execution time: given a previously planned
query, chances are that, at execution time (I'm specifically thinking about
prepared statements here), a different implementation of the same "node"
might be more suitable and could be used instead while the condition holds.

These comments got me wondering... would it be better to decide on

parallelism during execution instead of at plan time? That would allow us
to dynamically scale parallelism based on system load. If we don't even
consider parallelism until we've pulled some number of tuples/pages from a
relation,

this would also eliminate all parallel overhead on small relations.
--

I think we have access to this information in planner (RelOptInfo -> pages),
if we want, we can use that to eliminate the small relations from
parallelism, but question is how big relations do we want to consider
for parallelism, one way is to check via tests which I am planning to
follow, do you think we have any heuristic which we can use to decide
how big relations should be consider for parallelism?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: David Rowley (#7)
Re: Parallel Seq Scan

On Sat, Dec 6, 2014 at 10:43 AM, David Rowley <dgrowleyml@gmail.com> wrote:

On 4 December 2014 at 19:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

This is good news!

Thanks.

I've not gotten to look at the patch yet, but I thought you may be able

to make use of the attached at some point.

I also think so, that it can be used in near future to enhance
and provide more value to the parallel scan feature. Thanks
for taking the initiative to do the leg-work for supporting
aggregates.

It's bare-bones core support for allowing aggregate states to be merged

together with another aggregate state. I would imagine that if a query such
as:

SELECT MAX(value) FROM bigtable;

was run, then a series of parallel workers could go off and each find the

max value from their portion of the table and then perhaps some other node
type would then take all the intermediate results from the workers, once
they're finished, and join all of the aggregate states into one and return
that. Naturally, you'd need to check that all aggregates used in the
targetlist had a merge function first.

Direction sounds to be right.

This is just a few hours of work. I've not really tested the pg_dump

support or anything yet. I've also not added any new functions to allow
AVG() or COUNT() to work, I've really just re-used existing functions where
I could, as things like MAX() and BOOL_OR() can just make use of the
existing transition function. I thought that this might be enough for early
tests.

I'd imagine such a workload, ignoring IO overhead, should scale pretty

much linearly with the number of worker processes. Of course, if there was
a GROUP BY clause then the merger code would have to perform more work.

Agreed.

If you think you might be able to make use of this, then I'm willing to

go off and write all the other merge functions required for the other
aggregates.

Don't you think that first we should stabilize the basic (target list
and quals that can be independently evaluated by workers) parallel
scan and then jump to do such enhancements?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#8)
Re: Parallel Seq Scan

* Amit Kapila (amit.kapila16@gmail.com) wrote:

1. As the patch currently stands, it just shares the relevant
data (like relid, target list, block range each worker should
perform on etc.) to the worker and then worker receives that
data and form the planned statement which it will execute and
send the results back to master backend. So the question
here is do you think it is reasonable or should we try to form
the complete plan for each worker and then share the same
and may be other information as well like range table entries
which are required. My personal gut feeling in this matter
is that for long term it might be better to form the complete
plan of each worker in master and share the same, however
I think the current way as done in patch (okay that needs
some improvement) is also not bad and quite easier to implement.

For my 2c, I'd like to see it support exactly what the SeqScan node
supports and then also what Foreign Scan supports. That would mean we'd
then be able to push filtering down to the workers which would be great.
Even better would be figuring out how to parallelize an Append node
(perhaps only possible when the nodes underneath are all SeqScan or
ForeignScan nodes) since that would allow us to then parallelize the
work across multiple tables and remote servers.

One of the big reasons why I was asking about performance data is that,
today, we can't easily split a single relation across multiple i/o
channels. Sure, we can use RAID and get the i/o channel that the table
sits on faster than a single disk and possibly fast enough that a single
CPU can't keep up, but that's not quite the same. The historical
recommendations for Hadoop nodes is around one CPU per drive (of course,
it'll depend on workload, etc, etc, but still) and while there's still a
lot of testing, etc, to be done before we can be sure about the 'right'
answer for PG (and it'll also vary based on workload, etc), that strikes
me as a pretty reasonable rule-of-thumb to go on.

Of course, I'm aware that this won't be as easy to implement..

2. Next question related to above is what should be the
output of ExplainPlan, as currently worker is responsible
for forming its own plan, Explain Plan is not able to show
the detailed plan for each worker, is that okay?

I'm not entirely following this. How can the worker be responsible for
its own "plan" when the information passed to it (per the above
paragraph..) is pretty minimal? In general, I don't think we need to
have specifics like "this worker is going to do exactly X" because we
will eventually need some communication to happen between the worker and
the master process where the worker can ask for more work because it's
finished what it was tasked with and the master will need to give it
another chunk of work to do. I don't think we want exactly what each
worker process will do to be fully formed at the outset because, even
with the best information available, given concurrent load on the
system, it's not going to be perfect and we'll end up starving workers.
The plan, as formed by the master, should be more along the lines of
"this is what I'm gonna have my workers do" along w/ how many workers,
etc, and then it goes and does it. Perhaps for an 'explain analyze' we
return information about what workers actually *did* what, but that's a
whole different discussion.

3. Some places where optimizations are possible:
- Currently after getting the tuple from heap, it is deformed by
worker and sent via message queue to master backend, master
backend then forms the tuple and send it to upper layer which
before sending it to frontend again deforms it via slot_getallattrs(slot).

If this is done as I was proposing above, we might be able to avoid
this, but I don't know that it's a huge issue either way.. The bigger
issue is getting the filtering pushed down.

- Master backend currently receives the data from multiple workers
serially. We can optimize in a way that it can check other queues,
if there is no data in current queue.

Yes, this is pretty critical. In fact, it's one of the recommendations
I made previously about how to change the Append node to parallelize
Foreign Scan node work.

- Master backend is just responsible for coordination among workers
It shares the required information to workers and then fetch the
data processed by each worker, by using some more logic, we might
be able to make master backend also fetch data from heap rather than
doing just co-ordination among workers.

I don't think this is really necessary...

I think in all above places we can do some optimisation, however
we can do that later as well, unless they hit the performance badly for
cases which people care most.

I agree that we can improve the performance through various
optimizations later, but it's important to get the general structure and
design right or we'll end up having to reimplement a lot of it.

4. Should parallel_seqscan_degree value be dependent on other
backend processes like MaxConnections, max_worker_processes,
autovacuum_max_workers do or should it be independent like
max_wal_senders?

Well, we're not going to be able to spin off more workers than we have
process slots, but I'm not sure we need anything more than that? In any
case, this is definitely an area we can work on improving later and I
don't think it really impacts the rest of the design.

Thanks,

Stephen

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#12)
Re: Parallel Seq Scan

On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Amit Kapila (amit.kapila16@gmail.com) wrote:

1. As the patch currently stands, it just shares the relevant
data (like relid, target list, block range each worker should
perform on etc.) to the worker and then worker receives that
data and form the planned statement which it will execute and
send the results back to master backend. So the question
here is do you think it is reasonable or should we try to form
the complete plan for each worker and then share the same
and may be other information as well like range table entries
which are required. My personal gut feeling in this matter
is that for long term it might be better to form the complete
plan of each worker in master and share the same, however
I think the current way as done in patch (okay that needs
some improvement) is also not bad and quite easier to implement.

For my 2c, I'd like to see it support exactly what the SeqScan node
supports and then also what Foreign Scan supports. That would mean we'd
then be able to push filtering down to the workers which would be great.
Even better would be figuring out how to parallelize an Append node
(perhaps only possible when the nodes underneath are all SeqScan or
ForeignScan nodes) since that would allow us to then parallelize the
work across multiple tables and remote servers.

One of the big reasons why I was asking about performance data is that,
today, we can't easily split a single relation across multiple i/o
channels. Sure, we can use RAID and get the i/o channel that the table
sits on faster than a single disk and possibly fast enough that a single
CPU can't keep up, but that's not quite the same. The historical
recommendations for Hadoop nodes is around one CPU per drive (of course,
it'll depend on workload, etc, etc, but still) and while there's still a
lot of testing, etc, to be done before we can be sure about the 'right'
answer for PG (and it'll also vary based on workload, etc), that strikes
me as a pretty reasonable rule-of-thumb to go on.

Of course, I'm aware that this won't be as easy to implement..

2. Next question related to above is what should be the
output of ExplainPlan, as currently worker is responsible
for forming its own plan, Explain Plan is not able to show
the detailed plan for each worker, is that okay?

I'm not entirely following this. How can the worker be responsible for
its own "plan" when the information passed to it (per the above
paragraph..) is pretty minimal?

Because for a simple sequence scan that much information is sufficient,
basically if we have scanrelid, target list, qual and then RTE (primarily
relOid), then worker can form and perform sequence scan.

In general, I don't think we need to
have specifics like "this worker is going to do exactly X" because we
will eventually need some communication to happen between the worker and
the master process where the worker can ask for more work because it's
finished what it was tasked with and the master will need to give it
another chunk of work to do. I don't think we want exactly what each
worker process will do to be fully formed at the outset because, even
with the best information available, given concurrent load on the
system, it's not going to be perfect and we'll end up starving workers.
The plan, as formed by the master, should be more along the lines of
"this is what I'm gonna have my workers do" along w/ how many workers,
etc, and then it goes and does it.

I think here you want to say that work allocation for workers should be
dynamic rather fixed which I think makes sense, however we can try
such an optimization after some initial performance data.

Perhaps for an 'explain analyze' we
return information about what workers actually *did* what, but that's a
whole different discussion.

Agreed.

3. Some places where optimizations are possible:
- Currently after getting the tuple from heap, it is deformed by
worker and sent via message queue to master backend, master
backend then forms the tuple and send it to upper layer which
before sending it to frontend again deforms it via

slot_getallattrs(slot).

If this is done as I was proposing above, we might be able to avoid
this, but I don't know that it's a huge issue either way.. The bigger
issue is getting the filtering pushed down.

- Master backend currently receives the data from multiple workers
serially. We can optimize in a way that it can check other queues,
if there is no data in current queue.

Yes, this is pretty critical. In fact, it's one of the recommendations
I made previously about how to change the Append node to parallelize
Foreign Scan node work.

- Master backend is just responsible for coordination among workers
It shares the required information to workers and then fetch the
data processed by each worker, by using some more logic, we might
be able to make master backend also fetch data from heap rather than
doing just co-ordination among workers.

I don't think this is really necessary...

I think in all above places we can do some optimisation, however
we can do that later as well, unless they hit the performance badly for
cases which people care most.

I agree that we can improve the performance through various
optimizations later, but it's important to get the general structure and
design right or we'll end up having to reimplement a lot of it.

So to summarize my understanding, below are the set of things
which I should work on and in the order they are listed.

1. Push down qualification
2. Performance Data
3. Improve the way to push down the information related to worker.
4. Dynamic allocation of work for workers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#7)
Re: Parallel Seq Scan

On Sat, Dec 6, 2014 at 12:13 AM, David Rowley <dgrowleyml@gmail.com> wrote:

It's bare-bones core support for allowing aggregate states to be merged
together with another aggregate state. I would imagine that if a query such
as:

SELECT MAX(value) FROM bigtable;

was run, then a series of parallel workers could go off and each find the
max value from their portion of the table and then perhaps some other node
type would then take all the intermediate results from the workers, once
they're finished, and join all of the aggregate states into one and return
that. Naturally, you'd need to check that all aggregates used in the
targetlist had a merge function first.

I think this is great infrastructure and could also be useful for
pushing down aggregates in cases involving foreign data wrappers. But
I suggest we discuss it on a separate thread because it's not related
to parallel seq scan per se.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#10)
Re: Parallel Seq Scan

On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think we have access to this information in planner (RelOptInfo -> pages),
if we want, we can use that to eliminate the small relations from
parallelism, but question is how big relations do we want to consider
for parallelism, one way is to check via tests which I am planning to
follow, do you think we have any heuristic which we can use to decide
how big relations should be consider for parallelism?

Surely the Path machinery needs to decide this in particular cases
based on cost. We should assign some cost to starting a parallel
worker via some new GUC, like parallel_startup_cost = 100,000. And
then we should also assign a cost to the act of relaying a tuple from
the parallel worker to the master, maybe cpu_tuple_cost (or some new
GUC). For a small relation, or a query with a LIMIT clause, the
parallel startup cost will make starting a lot of workers look
unattractive, but for bigger relations it will make sense from a cost
perspective, which is exactly what we want.

There are probably other important considerations based on goals for
overall resource utilization, and also because at a certain point
adding more workers won't help because the disk will be saturated. I
don't know exactly what we should do about those issues yet, but the
steps described in the previous paragraph seem like a good place to
start anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#12)
Re: Parallel Seq Scan

On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost <sfrost@snowman.net> wrote:

For my 2c, I'd like to see it support exactly what the SeqScan node
supports and then also what Foreign Scan supports. That would mean we'd
then be able to push filtering down to the workers which would be great.
Even better would be figuring out how to parallelize an Append node
(perhaps only possible when the nodes underneath are all SeqScan or
ForeignScan nodes) since that would allow us to then parallelize the
work across multiple tables and remote servers.

I don't see how we can support the stuff ForeignScan does; presumably
any parallelism there is up to the FDW to implement, using whatever
in-core tools we provide. I do agree that parallelizing Append nodes
is useful; but let's get one thing done first before we start trying
to do thing #2.

I'm not entirely following this. How can the worker be responsible for
its own "plan" when the information passed to it (per the above
paragraph..) is pretty minimal? In general, I don't think we need to
have specifics like "this worker is going to do exactly X" because we
will eventually need some communication to happen between the worker and
the master process where the worker can ask for more work because it's
finished what it was tasked with and the master will need to give it
another chunk of work to do. I don't think we want exactly what each
worker process will do to be fully formed at the outset because, even
with the best information available, given concurrent load on the
system, it's not going to be perfect and we'll end up starving workers.
The plan, as formed by the master, should be more along the lines of
"this is what I'm gonna have my workers do" along w/ how many workers,
etc, and then it goes and does it. Perhaps for an 'explain analyze' we
return information about what workers actually *did* what, but that's a
whole different discussion.

I agree with this. For a first version, I think it's OK to start a
worker up for a particular sequential scan and have it help with that
sequential scan until the scan is completed, and then exit. It should
not, as the present version of the patch does, assign a fixed block
range to each worker; instead, workers should allocate a block or
chunk of blocks to work on until no blocks remain. That way, even if
every worker but one gets stuck, the rest of the scan can still
finish.

Eventually, we will want to be smarter about sharing works between
multiple parts of the plan, but I think it is just fine to leave that
as a future enhancement for now.

- Master backend is just responsible for coordination among workers
It shares the required information to workers and then fetch the
data processed by each worker, by using some more logic, we might
be able to make master backend also fetch data from heap rather than
doing just co-ordination among workers.

I don't think this is really necessary...

I think it would be an awfully good idea to make this work. The
master thread may be significantly faster than any of the others
because it has no IPC costs. We don't want to leave our best resource
sitting on the bench.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#15)
Re: Parallel Seq Scan

On Mon, Dec 8, 2014 at 11:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think we have access to this information in planner (RelOptInfo ->

pages),

if we want, we can use that to eliminate the small relations from
parallelism, but question is how big relations do we want to consider
for parallelism, one way is to check via tests which I am planning to
follow, do you think we have any heuristic which we can use to decide
how big relations should be consider for parallelism?

Surely the Path machinery needs to decide this in particular cases
based on cost. We should assign some cost to starting a parallel
worker via some new GUC, like parallel_startup_cost = 100,000. And
then we should also assign a cost to the act of relaying a tuple from
the parallel worker to the master, maybe cpu_tuple_cost (or some new
GUC). For a small relation, or a query with a LIMIT clause, the
parallel startup cost will make starting a lot of workers look
unattractive, but for bigger relations it will make sense from a cost
perspective, which is exactly what we want.

Sounds sensible. cpu_tuple_cost is already used for some other
purpose so not sure if it is right thing to override that parameter,
how about cpu_tuple_communication_cost or cpu_tuple_comm_cost.

There are probably other important considerations based on goals for
overall resource utilization, and also because at a certain point
adding more workers won't help because the disk will be saturated. I
don't know exactly what we should do about those issues yet, but the
steps described in the previous paragraph seem like a good place to
start anyway.

Agreed.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#16)
Re: Parallel Seq Scan

On Mon, Dec 8, 2014 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost <sfrost@snowman.net> wrote:

For my 2c, I'd like to see it support exactly what the SeqScan node
supports and then also what Foreign Scan supports. That would mean we'd
then be able to push filtering down to the workers which would be great.
Even better would be figuring out how to parallelize an Append node
(perhaps only possible when the nodes underneath are all SeqScan or
ForeignScan nodes) since that would allow us to then parallelize the
work across multiple tables and remote servers.

I don't see how we can support the stuff ForeignScan does; presumably
any parallelism there is up to the FDW to implement, using whatever
in-core tools we provide. I do agree that parallelizing Append nodes
is useful; but let's get one thing done first before we start trying
to do thing #2.

I'm not entirely following this. How can the worker be responsible for
its own "plan" when the information passed to it (per the above
paragraph..) is pretty minimal? In general, I don't think we need to
have specifics like "this worker is going to do exactly X" because we
will eventually need some communication to happen between the worker and
the master process where the worker can ask for more work because it's
finished what it was tasked with and the master will need to give it
another chunk of work to do. I don't think we want exactly what each
worker process will do to be fully formed at the outset because, even
with the best information available, given concurrent load on the
system, it's not going to be perfect and we'll end up starving workers.
The plan, as formed by the master, should be more along the lines of
"this is what I'm gonna have my workers do" along w/ how many workers,
etc, and then it goes and does it. Perhaps for an 'explain analyze' we
return information about what workers actually *did* what, but that's a
whole different discussion.

I agree with this. For a first version, I think it's OK to start a
worker up for a particular sequential scan and have it help with that
sequential scan until the scan is completed, and then exit. It should
not, as the present version of the patch does, assign a fixed block
range to each worker; instead, workers should allocate a block or
chunk of blocks to work on until no blocks remain. That way, even if
every worker but one gets stuck, the rest of the scan can still
finish.

I will check on this point and see if it is feasible to do something on
those lines, basically currently at Executor initialization phase, we
set the scan limits and then during Executor Run phase use
heap_getnext to fetch the tuples accordingly, but doing it dynamically
means at ExecutorRun phase we need to reset the scan limit for
which page/pages to scan, still I have to check if there is any problem
with such an idea. Do you any different idea in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#18)
Re: Parallel Seq Scan

On Tue, Dec 9, 2014 at 12:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I agree with this. For a first version, I think it's OK to start a
worker up for a particular sequential scan and have it help with that
sequential scan until the scan is completed, and then exit. It should
not, as the present version of the patch does, assign a fixed block
range to each worker; instead, workers should allocate a block or
chunk of blocks to work on until no blocks remain. That way, even if
every worker but one gets stuck, the rest of the scan can still
finish.

I will check on this point and see if it is feasible to do something on
those lines, basically currently at Executor initialization phase, we
set the scan limits and then during Executor Run phase use
heap_getnext to fetch the tuples accordingly, but doing it dynamically
means at ExecutorRun phase we need to reset the scan limit for
which page/pages to scan, still I have to check if there is any problem
with such an idea. Do you any different idea in mind?

Hmm. Well, it looks like there are basically two choices: you can
either (as you propose) deal with this above the level of the
heap_beginscan/heap_getnext API by scanning one or a few pages at a
time and then resetting the scan to a new starting page via
heap_setscanlimits; or alternatively, you can add a callback to
HeapScanDescData that, if non-NULL, will be invoked to get the next
block number to scan. I'm not entirely sure which is better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#13)
Re: Parallel Seq Scan

On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:

So to summarize my understanding, below are the set of things
which I should work on and in the order they are listed.

1. Push down qualification
2. Performance Data
3. Improve the way to push down the information related to worker.
4. Dynamic allocation of work for workers.

I have worked on the patch to accomplish above mentioned points
1, 2 and partly 3 and would like to share the progress with community.
If the statement contain quals that don't have volatile functions, then
they will be pushed down and the parallel can will be considered for
cost evaluation. I think eventually we might need some better way
to decide about which kind of functions are okay to be pushed.
I have also unified the way information is passed from master backend
to worker backends which is convert each node to string that has to be
passed and then later workers convert string to node, this has simplified
the related code.

I have taken performance data for different selectivity and complexity of
qual expressions, I understand that there will be other kind of scenario's
which we need to consider, however I think the current set of tests is good
place to start, please feel free to comment on kind of scenario's which you
want me to check

Performance Data
------------------------------
*m/c details*
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
*non-default settings in postgresql.conf*
max_connections=300
shared_buffers = 8GB
checkpoint_segments = 300
checkpoint_timeout = 30min
max_worker_processes=100

create table tbl_perf(c1 int, c2 char(1000));

30 million rows
------------------------
insert into tbl_perf values(generate_series(1,10000000),'aaaaa');
insert into tbl_perf values(generate_series(10000000,30000000),'aaaaa');

Function used in quals
-----------------------------------
A simple function which will perform some calculation and return
the value passed which can be used in qual condition.

create or replace function calc_factorial(a integer, fact_val integer)
returns integer
as $$
begin
perform (fact_val)!;
return a;
end;
$$ language plpgsql STABLE;

In below data,
num_workers - number of parallel workers configured using
parallel_seqscan_degree. 0, means it will execute sequence
scan and greater than 0 means parallel sequence scan.

exec_time - Execution Time given by Explain Analyze statement.

*Tests having quals containing function evaluation in qual*
*expressions.*

*Test-1*
*Query -* Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(29700000,10) and c2 like '%aa%';
*Selection_criteria – *1% of rows will be selected

*num_workers* *exec_time (ms)* 0 229534 2 121741 4 67051 8 35607 16
24743

*Test-2*
*Query - *Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(27000000,10) and c2 like '%aa%';
*Selection_criteria – *10% of rows will be selected

*num_workers* *exec_time (ms)* 0 226671 2 151587 4 93648 8 70540 16
55466

*Test-3*
*Query -* Explain analyze select c1 from tbl_perf
where c1 > calc_factorial(22500000,10) and c2 like '%aa%';
*Selection_criteria –* 25% of rows will be selected

*num_workers* *exec_time (ms)* 0 232673 2 197609 4 142686 8 111664 16
98097

*Tests having quals containing simple expressions in qual.*

*Test-4*
*Query - *Explain analyze select c1 from tbl_perf
where c1 > 29700000 and c2 like '%aa%';
*Selection_criteria –* 1% of rows will be selected

*num_workers* *exec_time (ms)* 0 15505 2 9155 4 6030 8 4523 16 4459
32 8259 64 13388
*Test-5*
*Query - *Explain analyze select c1 from tbl_perf
where c1 > 28500000 and c2 like '%aa%';
*Selection_criteria –* 5% of rows will be selected

*num_workers* *exec_time (ms)* 0 18906 2 13446 4 8970 8 7887 16 10403
*Test-6*
*Query -* Explain analyze select c1 from tbl_perf
where c1 > 27000000 and c2 like '%aa%';
*Selection_criteria – *10% of rows will be selected

*num_workers* *exec_time (ms)* 0 16132 2 23780 4 20275 8 11390 16
11418

Conclusion
------------------
1. Parallel workers help a lot when there is an expensive qualification
to evaluated, the more expensive the qualification the more better are
results.
2. It works well for low selectivity quals and as the selectivity increases,
the benefit tends to go down due to additional tuple communication cost
between workers and master backend.
3. After certain point, increasing having more number of workers won't
help and rather have negative impact, refer Test-4.

I think as discussed previously we need to introduce 2 additional cost
variables (parallel_startup_cost, cpu_tuple_communication_cost) to
estimate the parallel seq scan cost so that when the tables are small
or selectivity is high, it should increase the cost of parallel plan.

Thoughts and feedback for the current state of patch is welcome.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#20)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Dec 18, 2014 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net>

wrote:

So to summarize my understanding, below are the set of things
which I should work on and in the order they are listed.

1. Push down qualification
2. Performance Data
3. Improve the way to push down the information related to worker.
4. Dynamic allocation of work for workers.

I have worked on the patch to accomplish above mentioned points
1, 2 and partly 3 and would like to share the progress with community.

Sorry forgot to attach updated patch in last mail, attaching it now.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v2.patchapplication/octet-stream; name=parallel_seqscan_v2.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..91fbea5
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,359 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg);
+
+/*
+ * Indicate that an error came from a particular worker.
+ */
+static void
+worker_error_callback(void *arg)
+{
+	pid_t	pid = * (pid_t *) arg;
+
+	errcontext("worker backend, pid %d", pid);
+}
+
+/*
+ * shm_beginscan -
+ *		Initializes the shared memory scan descriptor to retrieve tuples
+ *		from worker backends. 
+ */
+ShmScanDesc
+shm_beginscan(int num_queues)
+{
+	ShmScanDesc		shmscan;
+
+	shmscan = palloc(sizeof(ShmScanDescData));
+
+	shmscan->num_shm_queues = num_queues;
+	shmscan->ss_cqueue = -1;
+	shmscan->shmscan_inited	= false;
+
+	return shmscan;
+}
+
+/*
+ * ExecInitWorkerResult -
+ *		Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(TupleDesc tupdesc)
+{
+	worker_result	workerResult;
+	int				i;
+	int	natts = tupdesc->natts;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+	workerResult->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	workerResult->typioparams = palloc(sizeof(Oid) * natts);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &workerResult->typioparams[i]);
+		fmgr_info(receive_function_id, &workerResult->receive_functions[i]);
+	}
+
+	return workerResult;
+}
+
+
+/*
+ * shm_getnext -
+ *		Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(ShmScanDesc shmScan, worker_result resultState,
+			shm_mq_handle **responseq, TupleDesc tupdesc)
+{
+	shm_mq_result	res;
+	char			msgtype;
+	Size			nbytes;
+	void			*data;
+	StringInfoData	msg;
+	int32			pid = 1234;
+	int				queueId = 0;
+
+	/*
+	 * calculate next starting queue used for fetching tuples
+	 */
+	if(!shmScan->shmscan_inited)
+	{
+		shmScan->shmscan_inited = true;
+		Assert(shmScan->num_shm_queues > 0);
+		queueId = 0;
+		--shmScan->num_shm_queues;
+	}
+	else
+		queueId = shmScan->ss_cqueue;
+
+	/* Initialize message buffer. */
+	initStringInfo(&msg);
+
+	/* Read and processes messages from the shared memory queues. */
+	for(;;)
+	{
+		for (;;)
+		{
+			/*
+			 * mark current queue used for fetching tuples, this is used
+			 * to fetch consecutive tuples from queue used in previous
+			 * fetch.
+			 */
+			shmScan->ss_cqueue = queueId;
+
+			/* Get next message. */
+			res = shm_mq_receive(responseq[queueId], &nbytes, &data, false);
+			if (res != SHM_MQ_SUCCESS)
+				break;
+
+			/*
+			 * Message-parsing routines operate on a null-terminated StringInfo,
+			 * so we must construct one.
+			 */
+			resetStringInfo(&msg);
+			enlargeStringInfo(&msg, nbytes);
+			msg.len = nbytes;
+			memcpy(msg.data, data, nbytes);
+			msg.data[nbytes] = '\0';
+			msgtype = pq_getmsgbyte(&msg);
+
+			/* Dispatch on message type. */
+			switch (msgtype)
+			{
+				case 'E':
+				case 'N':
+					{
+						ErrorData	edata;
+						ErrorContextCallback context;
+
+						/* Parse ErrorResponse or NoticeResponse. */
+						pq_parse_errornotice(&msg, &edata);
+
+						/*
+						 * Limit the maximum error level to ERROR.  We don't want
+						 * a FATAL inside the backend worker to kill the user
+						 * session.
+						 */
+						if (edata.elevel > ERROR)
+							edata.elevel = ERROR;
+
+						/*
+						 * Rethrow the error with an appropriate context method.
+						 * On error, we need to ensure that master backend stop
+						 * all other workers before propagating the error, so
+						 * we need to pass the pid's of all workers, so that same
+						 * can be done in error callback.
+						 * XXX - For now, I am just sending some random number, this
+						 * needs to be fixed.
+						 */
+						context.callback = worker_error_callback;
+						context.arg = (void *) &pid;
+						context.previous = error_context_stack;
+						error_context_stack = &context;
+						ThrowErrorData(&edata);
+						error_context_stack = context.previous;
+
+						break;
+					}
+				case 'A':
+					{
+						/* Propagate NotifyResponse. */
+						pq_putmessage(msg.data[0], &msg.data[1], nbytes - 1);
+						break;
+					}
+				case 'T':
+					{
+						int16	natts = pq_getmsgint(&msg, 2);
+						int16	i;
+
+						if (resultState->has_row_description)
+							elog(ERROR, "multiple RowDescription messages");
+						resultState->has_row_description = true;
+						if (natts != tupdesc->natts)
+							ereport(ERROR,
+									(errcode(ERRCODE_DATATYPE_MISMATCH),
+										errmsg("worker result rowtype does not match "
+										"the specified FROM clause rowtype")));
+
+						for (i = 0; i < natts; ++i)
+						{
+							Oid		type_id;
+
+							(void) pq_getmsgstring(&msg);	/* name */
+							(void) pq_getmsgint(&msg, 4);	/* table OID */
+							(void) pq_getmsgint(&msg, 2);	/* table attnum */
+							type_id = pq_getmsgint(&msg, 4);	/* type OID */
+							(void) pq_getmsgint(&msg, 2);	/* type length */
+							(void) pq_getmsgint(&msg, 4);	/* typmod */
+							(void) pq_getmsgint(&msg, 2);	/* format code */
+
+							if (type_id != tupdesc->attrs[i]->atttypid)
+								ereport(ERROR,
+										(errcode(ERRCODE_DATATYPE_MISMATCH),
+											errmsg("remote query result rowtype does not match "
+											"the specified FROM clause rowtype")));
+						}
+
+						pq_getmsgend(&msg);
+
+						break;
+					}
+				case 'D':
+					{
+						/* Handle DataRow message. */
+						HeapTuple	result;
+
+						result = form_result_tuple(resultState, tupdesc, &msg);
+						return result;
+					}
+				case 'C':
+					{
+						/*
+						 * Handle CommandComplete message. Ignore tags sent by
+						 * worker backend as we are anyway going to use tag of
+						 * master backend for sending the same to client.
+						 */
+						(void) pq_getmsgstring(&msg);
+						break;
+					}
+				case 'G':
+				case 'H':
+				case 'W':
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									errmsg("COPY protocol not allowed in worker")));
+					}
+
+				case 'Z':
+					{
+						/* Handle ReadyForQuery message. */
+						resultState->complete = true;
+						break;
+					}
+				default:
+					elog(WARNING, "unknown message type: %c (%zu bytes)",
+							msg.data[0], nbytes);
+					break;
+			}
+		}
+
+		/* Check whether the connection was broken prematurely. */
+		if (!resultState->complete)
+			ereport(ERROR,
+					(errcode(ERRCODE_CONNECTION_FAILURE),
+					 errmsg("lost connection to worker process with PID %d",
+					 pid)));
+
+		/*
+		 * if we have exhausted data from all worker queues, then terminate
+		 * processing data from queues.
+		 */
+		if (shmScan->num_shm_queues <=0)
+			break;
+		else
+		{
+			++queueId;
+			--shmScan->num_shm_queues;
+			resultState->has_row_description = false;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * form_result_tuple -
+ * Parse a DataRow message and form a result tuple.
+ */
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg)
+{
+	/* Handle DataRow message. */
+	int16	natts = pq_getmsgint(msg, 2);
+	int16	i;
+	Datum  *values = NULL;
+	bool   *isnull = NULL;
+	StringInfoData	buf;
+
+	if (!resultState->has_row_description)
+		elog(ERROR, "DataRow not preceded by RowDescription");
+	if (natts != tupdesc->natts)
+		elog(ERROR, "malformed DataRow");
+	if (natts > 0)
+	{
+		values = palloc(natts * sizeof(Datum));
+		isnull = palloc(natts * sizeof(bool));
+	}
+	initStringInfo(&buf);
+
+	for (i = 0; i < natts; ++i)
+	{
+		int32	bytes = pq_getmsgint(msg, 4);
+
+		if (bytes < 0)
+		{
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											NULL,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = true;
+		}
+		else
+		{
+			resetStringInfo(&buf);
+			appendBinaryStringInfo(&buf, pq_getmsgbytes(msg, bytes), bytes);
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											&buf,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = false;
+		}
+	}
+
+	pq_getmsgend(msg);
+
+	return heap_form_tuple(tupdesc, values, isnull);
+}
\ No newline at end of file
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 332f04a..f158583 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -714,6 +714,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -910,6 +911,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_ParallelSeqScan:
+			pname = sname = "Parallel Seq Scan";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1059,6 +1063,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1325,6 +1330,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_ParallelSeqScan:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((ParallelSeqScan *) plan)->num_workers, es);
+			ExplainPropertyInteger("Number of Blocks Per Workers",
+				((ParallelSeqScan *) plan)->num_blocks_per_worker, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2142,6 +2157,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..9a8ca75 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,7 +21,7 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeSeqscan.o nodeParallelSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index e27c062..a28e74e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeParallelSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +191,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_ParallelSeqScan:
+			result = (PlanState *) ExecInitParallelSeqScan((ParallelSeqScan *) node,
+														   estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +412,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			result = ExecParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +654,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			ExecEndParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 1319519..4f73a53 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -118,7 +118,7 @@ ExecScan(ScanState *node,
 	/*
 	 * Fetch data from node
 	 */
-	qual = node->ps.qual;
+	qual = node->ps.qualPushed ? NIL : node->ps.qual;
 	projInfo = node->ps.ps_ProjInfo;
 	econtext = node->ps.ps_ExprContext;
 
diff --git a/src/backend/executor/nodeParallelSeqscan.c b/src/backend/executor/nodeParallelSeqscan.c
new file mode 100644
index 0000000..b04fae1
--- /dev/null
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeParallelSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeParallelSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecParallelSeqScan				sequentially scans a relation.
+ *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		ExecInitParallelSeqScan			creates and initializes a parallel seqscan node.
+ *		ExecEndParallelSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		SeqNext
+ *
+ *		This is a workhorse for ExecParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ParallelSeqNext(ParallelSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(node->pss_currentShmScanDesc, node->pss_workerResult,
+						node->responseq,
+						node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * ParallelSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+ParallelSeqRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, ParallelSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check
+	 * are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitParallelScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitParallelScanRelation(SeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/* initialize a heapscan */
+	currentScanDesc = heap_beginscan(currentRelation,
+									 estate->es_snapshot,
+									 0,
+									 NULL);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+ParallelSeqScanState *
+ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags)
+{
+	ParallelSeqScanState *parallelscanstate;
+	ShmScanDesc			 currentShmScanDesc;
+	worker_result		 workerResult;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	parallelscanstate = makeNode(ParallelSeqScanState);
+	parallelscanstate->ss.ps.plan = (Plan *) node;
+	parallelscanstate->ss.ps.state = estate;
+
+	/*
+	 * for parallel seq scan, qual is always pushed to be
+	 * evaluated by backend worker.
+	 */
+	parallelscanstate->ss.ps.qualPushed = true;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &parallelscanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	parallelscanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) parallelscanstate);
+	parallelscanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) parallelscanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &parallelscanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &parallelscanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitParallelScanRelation(&parallelscanstate->ss, estate, eflags);
+
+	parallelscanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&parallelscanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&parallelscanstate->ss);
+
+	/* Initialize the workers required to perform parallel scan. */
+	InitiateWorkers(parallelscanstate->ss.ss_currentRelation->rd_id,
+					node->scan.plan.targetlist,
+					node->scan.plan.qual,
+					&parallelscanstate->responseq,
+					&parallelscanstate->seg,
+					node->num_blocks_per_worker,
+					node->num_workers);
+
+	
+	/*
+	 * use result tuple descriptor to fetch data from shared memory queues
+	 * as the worker backends would have put the data after projection.
+	 * number of queue's must be equal to number of worker backends.
+	 */
+	currentShmScanDesc = shm_beginscan(node->num_workers);
+	workerResult = ExecInitWorkerResult(parallelscanstate->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor);
+
+	parallelscanstate->pss_currentShmScanDesc = currentShmScanDesc;
+	parallelscanstate->pss_workerResult	= workerResult;
+
+	return parallelscanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelSeqScan(node)
+ *
+ *		Scans the relation sequentially from multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecParallelSeqScan(ParallelSeqScanState *node)
+{
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) ParallelSeqNext,
+					(ExecScanRecheckMtd) ParallelSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndParallelSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndParallelSeqScan(ParallelSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	/* detach from dynamic shared memory. */
+	dsm_detach(node->seg);
+}
+
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 53cfda5..131cfc5 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -139,6 +139,22 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 									 0,
 									 NULL);
 
+	/*
+	 * set the scan limits, if requested by plan.  If the end block
+	 * is not specified, then scan all the blocks till end.
+	 */
+	if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber &&
+		((SeqScan *) node->ps.plan)->endblock != InvalidBlockNumber)
+		heap_setscanlimits(currentScanDesc,
+						   ((SeqScan *) node->ps.plan)->startblock,
+						   (((SeqScan *) node->ps.plan)->endblock -
+						   ((SeqScan *) node->ps.plan)->startblock));
+	else if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber)
+			 heap_setscanlimits(currentScanDesc,
+								((SeqScan *) node->ps.plan)->startblock,
+								(currentScanDesc->rs_nblocks -
+								((SeqScan *) node->ps.plan)->startblock));
+
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
 
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 449fdc3..dfd3b52 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 659daa2..0296323 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -106,6 +106,8 @@ int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +221,63 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_parallelseqscan
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+	double		spc_seq_page_cost;
+	QualCost	qpqual_cost;
+	Cost		cpu_per_tuple;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	if (!enable_seqscan)
+		startup_cost += disable_cost;
+
+	/* fetch estimated page cost for tablespace containing table */
+	get_tablespace_page_costs(baserel->reltablespace,
+							  NULL,
+							  &spc_seq_page_cost);
+
+	/*
+	 * disk costs
+	 */
+	run_cost += spc_seq_page_cost * baserel->pages;
+
+	/* CPU costs */
+	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
+
+	startup_cost += qpqual_cost.startup;
+	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+	run_cost += cpu_per_tuple * baserel->tuples;
+
+	/*
+	 * We simply assume that cost will be equally shared by parallel
+	 * workers which might not be true especially for doing disk access.
+	 * XXX - We would like to change these values based on some concrete
+	 * tests.
+	 */
+	path->path.startup_cost = startup_cost / nWorkers;
+	path->path.total_cost = (startup_cost + run_cost) / nWorkers;
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..5245652
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,126 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+
+
+/*
+ *	IsTargetListContainNonVars -
+ *		Check if target list contain non-var entries.
+ */
+static bool
+IsTargetListContainNonVars(List *targetlist)
+{
+	ListCell   *l;
+
+	foreach(l, targetlist)
+	{
+		TargetEntry *te = (TargetEntry *) lfirst(l);
+
+		if (!IsA(te, TargetEntry))
+			continue;			/* probably should never happen */
+		if (!IsA(te->expr, Var))
+			return true;
+	}
+	return false;
+}
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/*
+	 * parallel scan is not supported for joins.
+	 */
+	if (root->simple_rel_array_size > 2)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	/*
+	 * parallel scan is not supported for non-var target list.
+	 *
+	 * XXX - This is to keep the implementation simple, we can do this
+	 * in future.  Here we are checking by passing root->parse->targetList
+	 * instead of rel->reltargetlist because rel->targetlist always contains
+	 * Vars (refer build_base_rel_tlists).
+	 */
+	if (IsTargetListContainNonVars(root->parse->targetList))
+	   return;
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	add_path(rel, (Path *) create_parallelseqscan_path(root, rel,
+													   num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 8f9ae4f..91a38e2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,9 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_parallelseqscan_plan(PlannerInfo *root,
+										 ParallelSeqPath *best_path,
+										 List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +103,9 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static ParallelSeqScan *make_parallelseqscan(List *qptlist, List *qpqual,
+											 Index scanrelid, int nworkers,
+											 BlockNumber nblocksperworker);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +234,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +350,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_ParallelSeqScan:
+			plan = (Plan *) create_parallelseqscan_plan(root,
+														(ParallelSeqPath *) best_path,
+														tlist,
+														scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -1133,6 +1147,71 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_worker_seqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by worker
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock)
+{
+	SeqScan    *scan_plan;
+
+	/*
+	 * Pass scan_relid as 1, this is okay for now as sequence scan worker
+	 * is allowed to operate on just one relation.
+	 * XXX - we should ideally get scanrelid from master backend.
+	 */
+	scan_plan = make_seqscan(targetList,
+							 scan_clauses,
+							 1);
+
+	scan_plan->startblock = startBlock;
+	scan_plan->endblock = endBlock;
+	return scan_plan;
+}
+
+/*
+ * create_parallelseqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by 'best_path'
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+static Scan *
+create_parallelseqscan_plan(PlannerInfo *root, ParallelSeqPath *best_path,
+					List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->path.param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_parallelseqscan(tlist,
+											  scan_clauses,
+											  scan_relid,
+											  best_path->num_workers,
+											  best_path->num_blocks_per_worker);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3314,6 +3393,30 @@ make_seqscan(List *qptlist,
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->scanrelid = scanrelid;
+	node->startblock = InvalidBlockNumber;
+	node->endblock = InvalidBlockNumber;
+
+	return node;
+}
+
+static ParallelSeqScan *
+make_parallelseqscan(List *qptlist,
+			   List *qpqual,
+			   Index scanrelid,
+			   int nworkers,
+			   BlockNumber nblocksperworker)
+{
+	ParallelSeqScan *node = makeNode(ParallelSeqScan);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+	node->num_blocks_per_worker = nblocksperworker;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f752ecc..34cf588 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,59 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+/*
+ * create_worker_seqscan_plannedstmt
+ *	Returns a planned statement to be used by worker for execution.
+ *	Ideally, master backend should form worker's planned statement
+ *	and pass the same to worker, however for now  master backend
+ *	just passes the required information and PlannedStmt is then
+ *	constructed by worker.
+ */
+PlannedStmt	*
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt)
+{
+	AclMode		required_access = ACL_SELECT;
+	RangeTblEntry *rte;
+	SeqScan    *scan_plan;
+	PlannedStmt	*result;
+
+	rte = makeNode(RangeTblEntry);
+	rte->rtekind = RTE_RELATION;
+	rte->relid = workerstmt->relId;
+	rte->relkind = 'r';
+	rte->requiredPerms = required_access;
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) workerstmt->qual);
+
+	scan_plan = create_worker_seqscan_plan(workerstmt->targetList,
+										   workerstmt->qual,
+										   workerstmt->startBlock,
+										   workerstmt->endBlock);
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) scan_plan;
+	result->rtable = list_make1(rte);
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->relationOids = lappend_oid(result->relationOids, rte->relid);;
+	result->invalItems = NIL;
+	result->nParamExec = 0;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4d3fbca..bb8af32 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -436,6 +436,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 319e8b2..ce3df40 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,37 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_parallelseqscan_path
+ *	  Creates a path corresponding to a parallel sequential scan, returning the
+ *	  pathnode.
+ */
+ParallelSeqPath *
+create_parallelseqscan_path(PlannerInfo *root, RelOptInfo *rel, int nWorkers)
+{
+	ParallelSeqPath	   *pathnode = makeNode(ParallelSeqPath);
+
+	pathnode->path.pathtype = T_ParallelSeqScan;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->num_workers = nWorkers;
+	/*
+	 * Divide the work equally among all the workers, for cases
+	 * where division is not equal (example if there are total
+	 * 10 blocks and 3 workers, then as per below calculation each
+	 * worker will scan 3 blocks), last worker will be responsible for
+	 * scanning remaining blocks (refer exec_worker_message).
+	 */
+	pathnode->num_blocks_per_worker = rel->pages / nWorkers;
+
+	cost_parallelseqscan(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..89d9aa2
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,607 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitiateWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define SHM_PARALLEL_SCAN_QUEUE_SIZE					65536
+
+/*
+ * This structure is stored in the dynamic shared memory segment.  We use
+ * it to determine whether all workers started up OK and successfully
+ * attached to their respective shared message queues.
+ */
+typedef struct
+{
+	slock_t		mutex;
+	int			workers_total;
+	int			workers_attached;
+	int			workers_ready;
+} shm_mq_header;
+
+/* Fixed-size data passed via our dynamic shared memory segment. */
+typedef struct worker_fixed_data
+{
+	Oid	database_id;
+	Oid	authenticated_user_id;
+	Oid	current_user_id;
+	int	sec_context;
+	NameData	database;
+	NameData	authenticated_user;
+} worker_fixed_data;
+
+/* Private state maintained by the launching backend for IPC. */
+typedef struct worker_info
+{
+	pid_t		pid;
+	Oid			current_user_id;
+	dsm_segment *seg;
+	BackgroundWorkerHandle *handle;
+	shm_mq_handle *responseq;
+	bool		consumed;
+} worker_info;
+
+typedef struct
+{
+	int			nworkers;
+	BackgroundWorkerHandle *handle[FLEXIBLE_ARRAY_MEMBER];
+} worker_state;
+
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PG_WORKER_MAGIC				0x50674267
+#define PG_WORKER_KEY_HDR_DATA		0
+#define PG_WORKER_KEY_FIXED_DATA	1
+#define PG_WORKER_KEY_RELID			2
+#define PG_WORKER_KEY_TARGETLIST	3
+#define PG_WORKER_KEY_QUAL			4
+#define PG_WORKER_KEY_BLOCKS		5
+#define PG_WORKER_FIXED_NKEYS		6
+
+void
+exec_worker_message(Datum) __attribute__((noreturn));
+
+static void
+setup_dynamic_shared_memory(Oid relId, List *targetList, List *qual,
+							shm_mq_handle ***responseq,
+							dsm_segment **segp, shm_mq_header **hdrp,
+							BlockNumber numBlocksPerWorker, int nWorkers);
+static worker_state *setup_backend_workers(dsm_segment *seg, int nworkers);
+static void cleanup_background_workers(dsm_segment *seg, Datum arg);
+static void
+wait_for_workers_to_become_ready(worker_state *wstate,
+								 volatile shm_mq_header *hdr);
+static bool check_worker_status(worker_state *wstate);
+static void bkworker_sigterm_handler(SIGNAL_ARGS);
+
+
+/*
+ * InitiateWorkers
+ *		It sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitiateWorkers(Oid relId, List *targetList, List *qual,
+				shm_mq_handle ***responseqp, dsm_segment **segp,
+				BlockNumber numBlocksPerWorker, int nWorkers)
+{
+	shm_mq_header *hdr;
+	worker_state *wstate;
+	int			i;
+
+	/* Create dynamic shared memory and table of contents. */
+	setup_dynamic_shared_memory(relId, targetList, qual, responseqp,
+								segp, &hdr, numBlocksPerWorker, nWorkers);
+
+	/* Register backend workers. */
+	wstate = setup_backend_workers(*segp, nWorkers);
+
+	for (i = 0; i < nWorkers; ++i)
+		shm_mq_set_handle((*responseqp)[i], wstate->handle[i]);
+
+	/* Wait for workers to become ready. */
+	wait_for_workers_to_become_ready(wstate, hdr);
+}
+
+/*
+ * Set up a dynamic shared memory segment.
+ *
+ * We set up a small control region that contains only a shm_mq_header,
+ * plus one region per message queue.  There are as many message queues as
+ * the number of workers.
+ */
+static void
+setup_dynamic_shared_memory(Oid relId, List *targetList, List *qual,
+							shm_mq_handle ***responseqp,
+							dsm_segment **segp, shm_mq_header **hdrp,
+							BlockNumber numBlocksPerWorker, int nWorkers)
+{
+	Size		segsize, targetlist_len, qual_len;
+	dsm_segment *seg;
+	shm_toc_estimator e;
+	shm_toc    *toc;
+	worker_fixed_data *fdata;
+	Oid		   *reliddata;
+	char	   *targetlistdata;
+	char	   *targetlist_str;
+	char	   *qualdata;
+	char	   *qual_str;
+	int		   i;
+	shm_mq	   *mq;
+	shm_mq_header *hdr;
+	BlockNumber	*num_blocks_per_worker;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(nWorkers * sizeof(shm_mq_handle*));
+
+	/* Create dynamic shared memory and table of contents. */
+	shm_toc_initialize_estimator(&e);
+
+	shm_toc_estimate_chunk(&e, sizeof(shm_mq_header));
+
+	shm_toc_estimate_chunk(&e, sizeof(worker_fixed_data));
+
+	shm_toc_estimate_chunk(&e, sizeof(relId));
+
+	targetlist_str = nodeToString(targetList);
+	targetlist_len = strlen(targetlist_str) + 1;
+	shm_toc_estimate_chunk(&e, targetlist_len);
+
+	qual_str = nodeToString(qual);
+	qual_len = strlen(qual_str) + 1;
+	shm_toc_estimate_chunk(&e, qual_len);
+
+	shm_toc_estimate_chunk(&e, sizeof(BlockNumber));
+
+	for (i = 0; i < nWorkers; ++i)
+		 shm_toc_estimate_chunk(&e, (Size) SHM_PARALLEL_SCAN_QUEUE_SIZE);
+
+	shm_toc_estimate_keys(&e, PG_WORKER_FIXED_NKEYS + nWorkers);
+
+	segsize = shm_toc_estimate(&e);
+
+	seg = dsm_create(segsize);
+	toc = shm_toc_create(PG_WORKER_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* Set up the header region. */
+	hdr = shm_toc_allocate(toc, sizeof(shm_mq_header));
+	SpinLockInit(&hdr->mutex);
+	hdr->workers_total = nWorkers;
+	hdr->workers_attached = 0;
+	hdr->workers_ready = 0;
+	shm_toc_insert(toc, PG_WORKER_KEY_HDR_DATA, hdr);
+
+	/* Store fixed-size data in dynamic shared memory. */
+	fdata = shm_toc_allocate(toc, sizeof(worker_fixed_data));
+	fdata->database_id = MyDatabaseId;
+	fdata->authenticated_user_id = GetAuthenticatedUserId();
+	GetUserIdAndSecContext(&fdata->current_user_id, &fdata->sec_context);
+	namestrcpy(&fdata->database, get_database_name(MyDatabaseId));
+	namestrcpy(&fdata->authenticated_user,
+			   GetUserNameFromId(fdata->authenticated_user_id));
+	shm_toc_insert(toc, PG_WORKER_KEY_FIXED_DATA, fdata);
+
+	/* Store scan relation id in dynamic shared memory. */
+	reliddata = shm_toc_allocate(toc, sizeof(Oid));
+	*reliddata = relId;
+	shm_toc_insert(toc, PG_WORKER_KEY_RELID, reliddata);
+
+	/* Store target list in dynamic shared memory. */
+	targetlistdata = shm_toc_allocate(toc, targetlist_len);
+	memcpy(targetlistdata, targetlist_str, targetlist_len);
+	shm_toc_insert(toc, PG_WORKER_KEY_TARGETLIST, targetlistdata);
+
+	/* Store qual list in dynamic shared memory. */
+	qualdata = shm_toc_allocate(toc, qual_len);
+	memcpy(qualdata, qual_str, qual_len);
+	shm_toc_insert(toc, PG_WORKER_KEY_QUAL, qualdata);
+
+	/* Store blocks to be scanned by each worker in dynamic shared memory. */
+	num_blocks_per_worker = shm_toc_allocate(toc, sizeof(BlockNumber));
+	*num_blocks_per_worker = numBlocksPerWorker;
+	shm_toc_insert(toc, PG_WORKER_KEY_BLOCKS, num_blocks_per_worker);
+
+	/* Establish one message queue per worker in dynamic shared memory. */
+	for (i = 1; i <= nWorkers; ++i)
+	{
+		mq = shm_mq_create(shm_toc_allocate(toc, (Size) SHM_PARALLEL_SCAN_QUEUE_SIZE),
+						   (Size) SHM_PARALLEL_SCAN_QUEUE_SIZE);
+		shm_toc_insert(toc, PG_WORKER_FIXED_NKEYS + i, mq);
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i-1] = shm_mq_attach(mq, seg, NULL);
+	}
+
+	/* Return results to caller. */
+	*segp = seg;
+	*hdrp = hdr;
+}
+
+/*
+ * Register backend workers.
+ */
+static worker_state *
+setup_backend_workers(dsm_segment *seg, int nWorkers)
+{
+	MemoryContext oldcontext;
+	BackgroundWorker worker;
+	worker_state *wstate;
+	int			i;
+
+	/*
+	 * We need the worker_state object and the background worker handles to
+	 * which it points to be allocated in CurTransactionContext rather than
+	 * ExprContext; otherwise, they'll be destroyed before the on_dsm_detach
+	 * hooks run.
+	 */
+	oldcontext = MemoryContextSwitchTo(CurTransactionContext);
+
+	/* Create worker state object. */
+	wstate = MemoryContextAlloc(TopTransactionContext,
+								offsetof(worker_state, handle) +
+								sizeof(BackgroundWorkerHandle *) * nWorkers);
+	wstate->nworkers = 0;
+
+	/*
+	 * Arrange to kill all the workers if we abort before or after all workers
+	 * are finished hooking themselves up to the dynamic shared memory segment.
+	 *
+	 * XXX - For killing workers, we need to have mechanism with which it can be
+	 * done before aborting the transaction.
+	 */
+
+	on_dsm_detach(seg, cleanup_background_workers,
+				  PointerGetDatum(wstate));
+
+	/* Configure a worker. */
+	worker.bgw_flags = 
+		BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_ConsistentState;
+	worker.bgw_restart_time = BGW_NEVER_RESTART;
+	worker.bgw_main = exec_worker_message;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "backend_worker");
+	worker.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(seg));
+	/* set bgw_notify_pid, so we can detect if the worker stops */
+	worker.bgw_notify_pid = MyProcPid;
+
+	/* Register the workers. */
+	for (i = 0; i < nWorkers; ++i)
+	{
+		if (!RegisterDynamicBackgroundWorker(&worker, &wstate->handle[i]))
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+					 errmsg("could not register background process"),
+				 errhint("You may need to increase max_worker_processes.")));
+		++wstate->nworkers;
+	}
+
+	/* All done. */
+	MemoryContextSwitchTo(oldcontext);
+	return wstate;
+}
+
+static void
+wait_for_workers_to_become_ready(worker_state *wstate,
+								 volatile shm_mq_header *hdr)
+{
+	bool		save_set_latch_on_sigusr1;
+	bool		result = false;
+
+	save_set_latch_on_sigusr1 = set_latch_on_sigusr1;
+	set_latch_on_sigusr1 = true;
+
+	PG_TRY();
+	{
+		for (;;)
+		{
+			int			workers_ready;
+
+			/* If all the workers are ready, we have succeeded. */
+			SpinLockAcquire(&hdr->mutex);
+			workers_ready = hdr->workers_ready;
+			SpinLockRelease(&hdr->mutex);
+			if (workers_ready >= wstate->nworkers)
+			{
+				result = true;
+				break;
+			}
+
+			/* If any workers (or the postmaster) have died, we have failed. */
+			if (!check_worker_status(wstate))
+			{
+				result = false;
+				break;
+			}
+
+			/* Wait to be signalled. */
+			WaitLatch(&MyProc->procLatch, WL_LATCH_SET, 0);
+
+			/* An interrupt may have occurred while we were waiting. */
+			CHECK_FOR_INTERRUPTS();
+
+			/* Reset the latch so we don't spin. */
+			ResetLatch(&MyProc->procLatch);
+		}
+	}
+	PG_CATCH();
+	{
+		set_latch_on_sigusr1 = save_set_latch_on_sigusr1;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	if (!result)
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("one or more background workers failed to start")));
+}
+
+static bool
+check_worker_status(worker_state *wstate)
+{
+	int			n;
+
+	/* If any workers (or the postmaster) have died, we have failed. */
+	for (n = 0; n < wstate->nworkers; ++n)
+	{
+		BgwHandleStatus status;
+		pid_t		pid;
+
+		status = GetBackgroundWorkerPid(wstate->handle[n], &pid);
+		/*if (status == BGWH_STOPPED || status == BGWH_POSTMASTER_DIED)*/
+		/*
+		 * XXX - Do we need to consider BGWH_STOPPED status, if directly return
+		 * false for BGWH_STOPPED, it could very well be possble that worker has
+		 * exited after completing the work in which case the caller of this won't
+		 * wait for other worker's status and main backend will lead to error
+		 * whereas everything is normal for such a case.
+		 */
+		if (status == BGWH_POSTMASTER_DIED)
+			return false;
+	}
+
+	/* Otherwise, things still look OK. */
+	return true;
+}
+
+static void
+cleanup_background_workers(dsm_segment *seg, Datum arg)
+{
+	worker_state *wstate = (worker_state *) arg;
+
+	while (wstate->nworkers > 0)
+	{
+		--wstate->nworkers;
+		TerminateBackgroundWorker(wstate->handle[wstate->nworkers]);
+	}
+}
+
+
+/*
+ * exec_execute_message
+ *
+ * Process an "Execute" message for a portal
+ */
+void
+exec_worker_message(Datum main_arg)
+{
+	dsm_segment *seg;
+	shm_toc     *toc;
+	worker_fixed_data *fdata;
+	char	    *targetlistdata;
+	char		*qualdata;
+	BlockNumber *num_blocks_per_worker;
+	BlockNumber  start_block;
+	BlockNumber  end_block;
+	shm_mq	    *mq;
+	shm_mq_handle *responseq;
+	int			myworkernumber;
+	volatile shm_mq_header *hdr;
+	Oid			*relId;
+	List		*targetList = NIL;
+	List		*qual = NIL;
+	PGPROC	    *registrant;
+	worker_stmt	*workerstmt;
+	ResourceOwner saveBackgroundWorkerResourceOwner;
+	MemoryContext saveBackgroundWorkerContext;
+
+	/* Establish signal handlers. */
+	pqsignal(SIGTERM, bkworker_sigterm_handler);
+	BackgroundWorkerUnblockSignals();
+
+	/* Set up a memory context and resource owner. */
+	Assert(CurrentResourceOwner == NULL);
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "backend_worker");
+	CurrentMemoryContext = AllocSetContextCreate(TopMemoryContext,
+												 "backend worker",
+												 ALLOCSET_DEFAULT_MINSIZE,
+												 ALLOCSET_DEFAULT_INITSIZE,
+												 ALLOCSET_DEFAULT_MAXSIZE);
+
+	/* Connect to the dynamic shared memory segment. */
+	seg = dsm_attach(DatumGetInt32(main_arg));
+	if (seg == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to map dynamic shared memory segment")));
+	toc = shm_toc_attach(PG_WORKER_MAGIC, dsm_segment_address(seg));
+	if (toc == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+			   errmsg("bad magic number in dynamic shared memory segment")));
+
+	/* Find data structures in dynamic shared memory. */
+	hdr = shm_toc_lookup(toc, PG_WORKER_KEY_HDR_DATA);
+	fdata = shm_toc_lookup(toc, PG_WORKER_KEY_FIXED_DATA);
+	relId = shm_toc_lookup(toc, PG_WORKER_KEY_RELID);
+	targetlistdata = shm_toc_lookup(toc, PG_WORKER_KEY_TARGETLIST);
+	qualdata = shm_toc_lookup(toc, PG_WORKER_KEY_QUAL);
+	num_blocks_per_worker = shm_toc_lookup(toc, PG_WORKER_KEY_BLOCKS);
+
+	/*
+	 * Acquire a worker number.
+	 *
+	 * Our worker number gives our identity: there may be just one
+	 * worker involved in this parallel operation, or there may be many.
+	 */
+	SpinLockAcquire(&hdr->mutex);
+	myworkernumber = ++hdr->workers_attached;
+	SpinLockRelease(&hdr->mutex);
+	if (myworkernumber > hdr->workers_total)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("too many message queue testing workers already")));
+
+	mq = shm_toc_lookup(toc, PG_WORKER_FIXED_NKEYS + myworkernumber);
+	shm_mq_set_sender(mq, MyProc);
+	responseq = shm_mq_attach(mq, seg, NULL);
+
+	end_block = myworkernumber * (*num_blocks_per_worker);
+	start_block = end_block - (*num_blocks_per_worker);
+
+	/*
+	 * Indicate that we're fully initialized and ready to begin the main part
+	 * of the parallel operation.
+	 *
+	 * Once we signal that we're ready, the user backend is entitled to assume
+	 * that our on_dsm_detach callbacks will fire before we disconnect from
+	 * the shared memory segment and exit.  Generally, that means we must have
+	 * attached to all relevant dynamic shared memory data structures by now.
+	 */
+	SpinLockAcquire(&hdr->mutex);
+	++hdr->workers_ready;
+	SpinLockRelease(&hdr->mutex);
+	registrant = BackendPidGetProc(MyBgworkerEntry->bgw_notify_pid);
+	if (registrant == NULL)
+	{
+		elog(DEBUG1, "registrant backend has exited prematurely");
+		proc_exit(1);
+	}
+	SetLatch(&registrant->procLatch);
+
+
+	/* Redirect protocol messages to responseq. */
+	pq_redirect_to_shm_mq(mq, responseq);
+	
+	/*
+	 * Connection initialization will destroy the CurrentResourceOwner and
+	 * CurrentMemoryContext as part of internal commit.  This idea of
+	 * internally starting whole new transactions is not good, but done
+	 * elsewhere also, refer PortalRun.
+	 */
+	saveBackgroundWorkerResourceOwner = CurrentResourceOwner;
+	saveBackgroundWorkerContext = CurrentMemoryContext;
+
+	/*
+	 * Initialize our user and database ID based on the strings version of
+	 * the data, and then go back and check that we actually got the database
+	 * and user ID that we intended to get.  We do this because it's not
+	 * impossible for the process that started us to die before we get here,
+	 * and the user or database could be renamed in the meantime.  We don't
+	 * want to latch on the wrong object by accident.  There should probably
+	 * be a variant of BackgroundWorkerInitializeConnection that accepts OIDs
+	 * rather than strings.
+	 */
+	BackgroundWorkerInitializeConnection(NameStr(fdata->database),
+										 NameStr(fdata->authenticated_user));
+	if (fdata->database_id != MyDatabaseId ||
+		fdata->authenticated_user_id != GetAuthenticatedUserId())
+		ereport(ERROR,
+				(errmsg("user or database renamed during backend worker startup")));
+
+	CurrentResourceOwner = saveBackgroundWorkerResourceOwner;
+	CurrentMemoryContext = saveBackgroundWorkerContext;
+
+	/* Restore targetList and qual from main backend. */
+	targetList = (List *) stringToNode(targetlistdata);
+	qual = (List *) stringToNode(qualdata);
+
+	/* Handle local_preload_libraries and session_preload_libraries. */
+	process_session_preload_libraries();
+
+	/* Restore user ID and security context. */
+	SetUserIdAndSecContext(fdata->current_user_id, fdata->sec_context);
+
+	workerstmt = palloc(sizeof(worker_stmt));
+
+	workerstmt->relId = *relId;
+	workerstmt->targetList = targetList;
+	workerstmt->qual = qual;
+	workerstmt->startBlock = start_block;
+
+	/* last worker should scan all the remaining blocks. */
+	if (myworkernumber == hdr->workers_total)
+		workerstmt->endBlock = InvalidBlockNumber;
+	else
+		workerstmt->endBlock = end_block;
+
+	/* Execute the worker command. */
+	exec_worker_stmt(workerstmt);
+
+	ProcessCompletedNotifies();
+
+	/* Signal that we are done. */
+	ReadyForQuery(DestRemote);
+
+	proc_exit(1);
+}
+
+/*
+ * When we receive a SIGTERM, we set InterruptPending and ProcDiePending just
+ * like a normal backend.  The next CHECK_FOR_INTERRUPTS() will do the right
+ * thing.
+ */
+static void
+bkworker_sigterm_handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	if (!proc_exit_inprogress)
+	{
+		InterruptPending = true;
+		ProcDiePending = true;
+	}
+
+	errno = save_errno;
+}
\ No newline at end of file
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5106f52..9d0c7c4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -99,6 +99,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -831,6 +832,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cc62b2c..7de5e0e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -55,6 +55,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1132,6 +1133,105 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * execute_worker_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_worker_stmt(worker_stmt *workerstmt)
+{
+	Portal		portal;
+	int16		format = 1;
+	DestReceiver *receiver;
+	bool		isTopLevel = true;
+	PlannedStmt	*planned_stmt;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+	BeginCommand("SELECT", DestNone);
+
+	/* Make sure we are in a transaction command */
+	start_xact_command();
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	planned_stmt = create_worker_seqscan_plannedstmt(workerstmt);
+	/*
+	 * Create unnamed portal to run the query or queries in. If there
+	 * already is one, silently drop it.
+	 */
+	portal = CreatePortal("", true, true);
+	/* Don't display the portal in pg_cursors */
+	portal->visible = false;
+
+	/*
+	 * We don't have to copy anything into the portal, because everything
+	 * we are passing here is in MessageContext, which will outlive the
+	 * portal anyway.
+	 */
+	PortalDefineQuery(portal,
+					  NULL,
+					  "",
+					  "",
+					  list_make1(planned_stmt),
+					  NULL);
+
+	/*
+	 * Start the portal.  No parameters here.
+	 */
+	PortalStart(portal, NULL, 0, InvalidSnapshot);
+
+	/* We always use binary format, for efficiency. */
+	PortalSetResultFormat(portal, 1, &format);
+
+	receiver = CreateDestReceiver(DestRemote);
+	SetRemoteDestReceiverParams(receiver, portal);
+
+	/*
+	 * Only once the portal and destreceiver have been established can
+	 * we return to the transaction context.  All that stuff needs to
+	 * survive an internal commit inside PortalRun!
+	 */
+	MemoryContextSwitchTo(oldcontext);
+
+	/*
+	 * Run the portal to completion, and then drop it (and the receiver).
+	 */
+	(void) PortalRun(portal,
+					 FETCH_ALL,
+					 isTopLevel,
+					 receiver,
+					 receiver,
+					 NULL);
+
+	(*receiver->rDestroy) (receiver);
+
+	PortalDrop(portal, false);
+
+	finish_xact_command();
+
+	/*
+	 * Send appropriate CommandComplete to client.  There is no
+	 * need to send completion tag from worker as that won't be
+	 * of any use considering the completiong tag of master backend
+	 * will be used for sending to client.
+	 */
+	EndCommand("", DestRemote);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b1bff7f..6d855e3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -630,6 +630,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2445,6 +2447,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..50f7a27 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -497,6 +497,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index f2c7ca1..f88ef2e 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,7 +20,6 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
-
 typedef struct HeapScanDescData
 {
 	/* scan parameters */
@@ -105,4 +104,13 @@ typedef struct SysScanDescData
 	Snapshot	snapshot;		/* snapshot to unregister at end of scan */
 }	SysScanDescData;
 
+/* struct for scanning shared memory queues */
+typedef struct ShmScanDescData
+{
+	/* scan current state */
+	int			num_shm_queues;	/* number of shared memory queues used in scan. */
+	int			ss_cqueue;		/* current queue # in scan, if any */
+	bool		shmscan_inited;		/* false = scan not init'd yet */
+}	ShmScanDescData;
+
 #endif   /* RELSCAN_H */
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..aa444bc
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	FmgrInfo   *receive_functions;
+	Oid		   *typioparams;
+	bool		has_row_description;
+	bool		complete;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+typedef struct ShmScanDescData *ShmScanDesc;
+
+extern worker_result ExecInitWorkerResult(TupleDesc tupdesc);
+extern ShmScanDesc shm_beginscan(int num_queues);
+extern HeapTuple shm_getnext(ShmScanDesc shmScan, worker_result resultState,
+							 shm_mq_handle **responseq, TupleDesc tupdesc);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/nodeParallelSeqscan.h b/src/include/executor/nodeParallelSeqscan.h
new file mode 100644
index 0000000..b638a24
--- /dev/null
+++ b/src/include/executor/nodeParallelSeqscan.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeparallelSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeParallelSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARALLELSEQSCAN_H
+#define NODEPARALLELSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern ParallelSeqScanState *ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecParallelSeqScan(ParallelSeqScanState *node);
+extern void ExecEndParallelSeqScan(ParallelSeqScanState *node);
+
+extern Size EstimateScanRelationIdSpace(Oid relId);
+extern void SerializeScanRelationId(Oid relId, Size maxsize,
+									char *start_address);
+extern void RestoreScanRelationId(Oid *relId, char *start_address);
+
+extern Size EstimateTargetListSpace(List *targetList);
+extern void SerializeTargetList(List *targetList, Size maxsize,
+								char *start_address);
+extern void RestoreTargetList(List **targetList, char *start_address);
+
+#endif   /* NODEPARALLELSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41b13b2..7a615bc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,11 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -1021,6 +1023,9 @@ typedef struct PlanState
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 	bool		ps_TupFromTlist;/* state flag for processing set-valued
 								 * functions in targetlist */
+	bool		qualPushed;		/* indicates that qual is pushed to backend
+								 * worker, so no need to evaluate it after
+								 * getting the tuple in main backend. */
 } PlanState;
 
 /* ----------------
@@ -1212,6 +1217,23 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * ParallelScanState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct ParallelScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	dsm_segment *seg;
+	shm_mq_handle **responseq;
+	ShmScanDesc pss_currentShmScanDesc;
+	worker_result	pss_workerResult;
+} ParallelScanState;
+
+typedef ParallelScanState ParallelSeqScanState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bc71fea..c48df6c 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_ParallelSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +98,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_ParallelSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +219,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_ParallelSeqPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 458eeb0..1ed9887 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -23,6 +23,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +157,15 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for execution. */
+typedef struct worker_stmt
+{
+	Oid			relId;
+	List		*targetList;
+	List		*qual;
+	BlockNumber startBlock;
+	BlockNumber endBlock;
+} worker_stmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 48203a0..e57c2d4 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,7 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -269,6 +270,8 @@ typedef struct Scan
 {
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+	BlockNumber startblock;		/* block to start seq scan */
+	BlockNumber endblock;		/* block upto which scan has to be done */
 } Scan;
 
 /* ----------------
@@ -278,6 +281,17 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct ParallelSeqScan
+{
+	Scan		scan;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqScan;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 7116496..09fb141 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct ParallelSeqPath
+{
+	Path		path;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..a738c54 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +69,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 26b17f5..901c792 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern ParallelSeqPath *create_parallelseqscan_path(PlannerInfo *root,
+					RelOptInfo *rel, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index afa5f9b..d2a2760 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 3fdc2cb..b382a27 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -41,6 +41,9 @@ extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  * prototypes for plan/createplan.c
  */
 extern Plan *create_plan(PlannerInfo *root, Path *best_path);
+extern SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index 1e942c5..752bd16 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..19d6182
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+extern int	parallel_seqscan_degree;
+extern void InitiateWorkers(Oid relId, List *targetList,
+							List *qual,
+							shm_mq_handle ***responseqp,
+							dsm_segment **segp,
+							BlockNumber numBlocksPerWorker,
+							int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 60f7532..6087b5e 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -83,5 +83,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_worker_stmt(worker_stmt *workerstmt);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 47ff880..532d2db 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#22Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#20)
Re: Parallel Seq Scan

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

1. Parallel workers help a lot when there is an expensive qualification
to evaluated, the more expensive the qualification the more better are
results.

I'd certainly hope so. ;)

2. It works well for low selectivity quals and as the selectivity increases,
the benefit tends to go down due to additional tuple communication cost
between workers and master backend.

I'm a bit sad to hear that the communication between workers and the
master backend is already being a bottleneck. Now, that said, the box
you're playing with looks to be pretty beefy and therefore the i/o
subsystem might be particularly good, but generally speaking, it's a lot
faster to move data in memory than it is to pull it off disk, and so I
wouldn't expect the tuple communication between processes to really be
the bottleneck...

3. After certain point, increasing having more number of workers won't
help and rather have negative impact, refer Test-4.

Yes, I see that too and it's also interesting- have you been able to
identify why? What is the overhead (specifically) which is causing
that?

I think as discussed previously we need to introduce 2 additional cost
variables (parallel_startup_cost, cpu_tuple_communication_cost) to
estimate the parallel seq scan cost so that when the tables are small
or selectivity is high, it should increase the cost of parallel plan.

I agree that we need to figure out a way to cost out parallel plans, but
I have doubts about these being the right way to do that. There has
been quite a bit of literature regarding parallel execution and
planning- have you had a chance to review anything along those lines?
We certainly like to draw on previous experiences and analysis rather
than trying to pave our own way.

With these additional costs comes the consideration that we're looking
for a wall-clock runtime proxy and therefore, while we need to add costs
for parallel startup and tuple communication, we have to reduce the
overall cost because of the parallelism or we'd never end up choosing a
parallel plan. Is the thought to simply add up all the costs and then
divide? Or perhaps to divide the cost of the actual plan but then add
in the parallel startup cost and the tuple communication cost?

Perhaps there has been prior discussion on these points but I'm thinking
we need a README or similar which discusses all of this and includes any
references out to academic papers or similar as appropriate.

Thanks!

Stephen

#23Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#22)
Re: Parallel Seq Scan

On Fri, Dec 19, 2014 at 7:51 AM, Stephen Frost <sfrost@snowman.net> wrote:

3. After certain point, increasing having more number of workers won't
help and rather have negative impact, refer Test-4.

Yes, I see that too and it's also interesting- have you been able to
identify why? What is the overhead (specifically) which is causing
that?

Let's rewind. Amit's results show that, with a naive algorithm
(pre-distributing equal-sized chunks of the relation to every worker)
and a fairly-naive first cut at how to pass tuples around (I believe
largely from what I did in pg_background) he can sequential-scan a
table with 8 workers at 6.4 times the speed of a single process, and
you're complaining because it's not efficient enough? It's a first
draft! Be happy we got 6.4x, for crying out loud!

The barrier to getting parallel sequential scan (or any parallel
feature at all) committed is not going to be whether an 8-way scan is
6.4 times faster or 7.1 times faster or 7.8 times faster. It's going
to be whether it's robust and won't break things. We should be
focusing most of our effort here on identifying and fixing robustness
problems. I'd vote to commit a feature like this with a 3x
performance speedup if I thought it was robust enough.

I'm not saying we shouldn't try to improve the performance here - we
definitely should. But I don't think we should say, oh, an 8-way scan
isn't good enough, we need a 16-way or 32-way scan in order for this
to be efficient. That is getting your priorities quite mixed up.

I think as discussed previously we need to introduce 2 additional cost
variables (parallel_startup_cost, cpu_tuple_communication_cost) to
estimate the parallel seq scan cost so that when the tables are small
or selectivity is high, it should increase the cost of parallel plan.

I agree that we need to figure out a way to cost out parallel plans, but
I have doubts about these being the right way to do that. There has
been quite a bit of literature regarding parallel execution and
planning- have you had a chance to review anything along those lines?
We certainly like to draw on previous experiences and analysis rather
than trying to pave our own way.

I agree that it would be good to review the literature, but am not
aware of anything relevant. Could you (or can anyone) provide some
links?

With these additional costs comes the consideration that we're looking
for a wall-clock runtime proxy and therefore, while we need to add costs
for parallel startup and tuple communication, we have to reduce the
overall cost because of the parallelism or we'd never end up choosing a
parallel plan. Is the thought to simply add up all the costs and then
divide? Or perhaps to divide the cost of the actual plan but then add
in the parallel startup cost and the tuple communication cost?

This has been discussed, on this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#23)
Re: Parallel Seq Scan

Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Fri, Dec 19, 2014 at 7:51 AM, Stephen Frost <sfrost@snowman.net> wrote:

3. After certain point, increasing having more number of workers won't
help and rather have negative impact, refer Test-4.

Yes, I see that too and it's also interesting- have you been able to
identify why? What is the overhead (specifically) which is causing
that?

Let's rewind. Amit's results show that, with a naive algorithm
(pre-distributing equal-sized chunks of the relation to every worker)
and a fairly-naive first cut at how to pass tuples around (I believe
largely from what I did in pg_background) he can sequential-scan a
table with 8 workers at 6.4 times the speed of a single process, and
you're complaining because it's not efficient enough? It's a first
draft! Be happy we got 6.4x, for crying out loud!

He also showed cases where parallelizing a query even with just two
workers caused a serious increase in the total runtime (Test 6). Even
having four workers was slower in that case, but a modest performance
improvment was reached at eight but then no improvement from that was
seen when running with 16.

Being able to understand what's happening will inform how we cost this
to, hopefully, achieve the 6.4x gains where we can and avoid the
pitfalls of performing worse than a single thread in cases where
parallelism doesn't help. What would likely be very helpful in the
analysis would be CPU time information- when running with eight workers,
were we using 800% CPU (8x 100%), or something less (perhaps due to
locking, i/o, or other processes).

Perhaps it's my fault for not being surprised that a naive first cut
gives us such gains as my experience with parallel operations and PG has
generally been very good (through the use of multiple connections to the
DB and therefore independent transactions, of course). I'm very excited
that we're making such great progress towards having parallel execution
in the DB as I've often used PG in data warehouse use-cases.

The barrier to getting parallel sequential scan (or any parallel
feature at all) committed is not going to be whether an 8-way scan is
6.4 times faster or 7.1 times faster or 7.8 times faster. It's going
to be whether it's robust and won't break things. We should be
focusing most of our effort here on identifying and fixing robustness
problems. I'd vote to commit a feature like this with a 3x
performance speedup if I thought it was robust enough.

I don't have any problem if an 8-way scan is 6.4x faster or if it's 7.1
times faster, but what if that 3x performance speedup is only achieved
when running with 8 CPUs at 100%? We'd have to coach our users to
constantly be tweaking the enable_parallel_query (or whatever) option
for the queries where it helps and turning it off for others. I'm not
so excited about that.

I'm not saying we shouldn't try to improve the performance here - we
definitely should. But I don't think we should say, oh, an 8-way scan
isn't good enough, we need a 16-way or 32-way scan in order for this
to be efficient. That is getting your priorities quite mixed up.

I don't think I said that. What I was getting at is that we need a cost
system which accounts for the costs accurately enough that we don't end
up with worse performance than single-threaded operation. In general, I
don't expect that to be very difficult and we can be conservative in the
initial releases to hopefully avoid regressions, but it absolutely needs
consideration.

I think as discussed previously we need to introduce 2 additional cost
variables (parallel_startup_cost, cpu_tuple_communication_cost) to
estimate the parallel seq scan cost so that when the tables are small
or selectivity is high, it should increase the cost of parallel plan.

I agree that we need to figure out a way to cost out parallel plans, but
I have doubts about these being the right way to do that. There has
been quite a bit of literature regarding parallel execution and
planning- have you had a chance to review anything along those lines?
We certainly like to draw on previous experiences and analysis rather
than trying to pave our own way.

I agree that it would be good to review the literature, but am not
aware of anything relevant. Could you (or can anyone) provide some
links?

There's certainly documentation available from the other RDBMS' which
already support parallel query, as one source. Other academic papers
exist (and once you've linked into one, the references and prior work
helps bring in others). Sadly, I don't currently have ACM access (might
have to change that..), but there are publicly available papers also,
such as:

http://i.stanford.edu/pub/cstr/reports/cs/tr/96/1570/CS-TR-96-1570.pdf
http://www.vldb.org/conf/1998/p251.pdf
http://www.cs.uiuc.edu/class/fa05/cs591han/sigmodpods04/sigmod/pdf/I-001c.pdf

With these additional costs comes the consideration that we're looking
for a wall-clock runtime proxy and therefore, while we need to add costs
for parallel startup and tuple communication, we have to reduce the
overall cost because of the parallelism or we'd never end up choosing a
parallel plan. Is the thought to simply add up all the costs and then
divide? Or perhaps to divide the cost of the actual plan but then add
in the parallel startup cost and the tuple communication cost?

This has been discussed, on this thread.

Fantastic. What I found in the patch was:

+   /*
+    * We simply assume that cost will be equally shared by parallel
+    * workers which might not be true especially for doing disk access.
+    * XXX - We would like to change these values based on some concrete
+    * tests.
+    */

What I asked for was:

----
I'm thinking we need a README or similar which discusses all of this and
includes any references out to academic papers or similar as appropriate.
----

Perhaps it doesn't deserve its own README, but we clearly need more.

Thanks!

Stephen

#25Marko Tiikkaja
marko@joh.to
In reply to: Stephen Frost (#24)
Re: Parallel Seq Scan

On 12/19/14 3:27 PM, Stephen Frost wrote:

We'd have to coach our users to
constantly be tweaking the enable_parallel_query (or whatever) option
for the queries where it helps and turning it off for others. I'm not
so excited about that.

I'd be perfectly (that means 100%) happy if it just defaulted to off,
but I could turn it up to 11 whenever I needed it. I don't believe to
be the only one with this opinion, either.

.marko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Stephen Frost
sfrost@snowman.net
In reply to: Marko Tiikkaja (#25)
Re: Parallel Seq Scan

* Marko Tiikkaja (marko@joh.to) wrote:

On 12/19/14 3:27 PM, Stephen Frost wrote:

We'd have to coach our users to
constantly be tweaking the enable_parallel_query (or whatever) option
for the queries where it helps and turning it off for others. I'm not
so excited about that.

I'd be perfectly (that means 100%) happy if it just defaulted to
off, but I could turn it up to 11 whenever I needed it. I don't
believe to be the only one with this opinion, either.

Perhaps we should reconsider our general position on hints then and
add them so users can define the plan to be used.. For my part, I don't
see this as all that much different.

Consider if we were just adding HashJoin support today as an example.
Would we be happy if we had to default to enable_hashjoin = off? Or if
users had to do that regularly because our costing was horrid? It's bad
enough that we have to resort to those tweaks today in rare cases.

Thanks,

Stephen

#27Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#26)
Re: Parallel Seq Scan

On Fri, Dec 19, 2014 at 9:39 AM, Stephen Frost <sfrost@snowman.net> wrote:

Perhaps we should reconsider our general position on hints then and
add them so users can define the plan to be used.. For my part, I don't
see this as all that much different.

Consider if we were just adding HashJoin support today as an example.
Would we be happy if we had to default to enable_hashjoin = off? Or if
users had to do that regularly because our costing was horrid? It's bad
enough that we have to resort to those tweaks today in rare cases.

If you're proposing that it is not reasonable to have a GUC that
limits the degree of parallelism, then I think that's outright crazy:
that is probably the very first GUC we need to add. New query
processing capabilities can entail new controlling GUCs, and
parallelism, being as complex at it is, will probably add several of
them.

But the big picture here is that if you want to ever have parallelism
in PostgreSQL at all, you're going to have to live with the first
version being pretty crude. I think it's quite likely that the first
version of parallel sequential scan will be just as buggy as Hot
Standby was when we first added it, or as buggy as the multi-xact code
was when it went in, and probably subject to an even greater variety
of taxing limitations than any feature we've committed in the 6 years
I've been involved in the project. We get to pick between that and
not having it at all.

I'll take a look at the papers you sent about parallel query
optimization, but personally I think that's putting the cart not only
before the horse but also before the road. For V1, we need a query
optimization model that does not completely suck - no more. The key
criterion here is that this has to WORK. There will be time enough to
improve everything else once we reach that goal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Stephen Frost (#26)
Re: Parallel Seq Scan

On 12/19/2014 04:39 PM, Stephen Frost wrote:

* Marko Tiikkaja (marko@joh.to) wrote:

On 12/19/14 3:27 PM, Stephen Frost wrote:

We'd have to coach our users to
constantly be tweaking the enable_parallel_query (or whatever) option
for the queries where it helps and turning it off for others. I'm not
so excited about that.

I'd be perfectly (that means 100%) happy if it just defaulted to
off, but I could turn it up to 11 whenever I needed it. I don't
believe to be the only one with this opinion, either.

Perhaps we should reconsider our general position on hints then and
add them so users can define the plan to be used.. For my part, I don't
see this as all that much different.

Consider if we were just adding HashJoin support today as an example.
Would we be happy if we had to default to enable_hashjoin = off? Or if
users had to do that regularly because our costing was horrid? It's bad
enough that we have to resort to those tweaks today in rare cases.

This is somewhat different. Imagine that we achieve perfect
parallelization, so that when you set enable_parallel_query=8, every
query runs exactly 8x faster on an 8-core system, by using all eight cores.

Now, you might still want to turn parallelization off, or at least set
it to a lower setting, on an OLTP system. You might not want a single
query to hog all CPUs to run one query faster; you'd want to leave some
for other queries. In particular, if you run a mix of short
transactions, and some background-like tasks that run for minutes or
hours, you do not want to starve the short transactions by giving all
eight CPUs to the background task.

Admittedly, this is a rather crude knob to tune for such things,
but it's quite intuitive to a DBA: how many CPU cores is one query
allowed to utilize? And we don't really have anything better.

In real life, there's always some overhead to parallelization, so that
even if you can make one query run faster by doing it, you might hurt
overall throughput. To some extent, it's a latency vs. throughput
tradeoff, and it's quite reasonable to have a GUC for that because
people have different priorities.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Heikki Linnakangas (#28)
Re: Parallel Seq Scan

On 20/12/14 03:54, Heikki Linnakangas wrote:

On 12/19/2014 04:39 PM, Stephen Frost wrote:

* Marko Tiikkaja (marko@joh.to) wrote:

On 12/19/14 3:27 PM, Stephen Frost wrote:

We'd have to coach our users to
constantly be tweaking the enable_parallel_query (or whatever) option
for the queries where it helps and turning it off for others. I'm not
so excited about that.

I'd be perfectly (that means 100%) happy if it just defaulted to
off, but I could turn it up to 11 whenever I needed it. I don't
believe to be the only one with this opinion, either.

Perhaps we should reconsider our general position on hints then and
add them so users can define the plan to be used.. For my part, I don't
see this as all that much different.

Consider if we were just adding HashJoin support today as an example.
Would we be happy if we had to default to enable_hashjoin = off? Or if
users had to do that regularly because our costing was horrid? It's bad
enough that we have to resort to those tweaks today in rare cases.

This is somewhat different. Imagine that we achieve perfect
parallelization, so that when you set enable_parallel_query=8, every
query runs exactly 8x faster on an 8-core system, by using all eight
cores.

Now, you might still want to turn parallelization off, or at least set
it to a lower setting, on an OLTP system. You might not want a single
query to hog all CPUs to run one query faster; you'd want to leave
some for other queries. In particular, if you run a mix of short
transactions, and some background-like tasks that run for minutes or
hours, you do not want to starve the short transactions by giving all
eight CPUs to the background task.

Admittedly, this is a rather crude knob to tune for such things,
but it's quite intuitive to a DBA: how many CPU cores is one query
allowed to utilize? And we don't really have anything better.

In real life, there's always some overhead to parallelization, so that
even if you can make one query run faster by doing it, you might hurt
overall throughput. To some extent, it's a latency vs. throughput
tradeoff, and it's quite reasonable to have a GUC for that because
people have different priorities.

- Heikki

How about 3 numbers:

minCPUs # > 0
maxCPUs # >= minCPUs
fractionOfCPUs # rounded up

If you just have the /*number*/ of CPUs then a setting that is
appropriate for quad core, may be too /*small*/ for an octo core processor.

If you just have the /*fraction*/ of CPUs then a setting that is
appropriate for quad core, may be too /*large*/ for an octo core processor.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#27)
Re: Parallel Seq Scan

Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Fri, Dec 19, 2014 at 9:39 AM, Stephen Frost <sfrost@snowman.net> wrote:

Perhaps we should reconsider our general position on hints then and
add them so users can define the plan to be used.. For my part, I don't
see this as all that much different.

Consider if we were just adding HashJoin support today as an example.
Would we be happy if we had to default to enable_hashjoin = off? Or if
users had to do that regularly because our costing was horrid? It's bad
enough that we have to resort to those tweaks today in rare cases.

If you're proposing that it is not reasonable to have a GUC that
limits the degree of parallelism, then I think that's outright crazy:

I'm pretty sure that I didn't say anything along those lines. I'll try
to be clearer.

What I'd like is such a GUC that we can set at a reasonable default of,
say, 4, and trust that our planner will generally do the right thing.
Clearly, this may be something which admins have to tweak but what I
would really like to avoid is users having to set this GUC explicitly
for each of their queries.

that is probably the very first GUC we need to add. New query
processing capabilities can entail new controlling GUCs, and
parallelism, being as complex at it is, will probably add several of
them.

That's fine if they're intended for debugging issues or dealing with
unexpected bugs or issues, but let's not go into this thinking we should
add GUCs which are geared with the expectation of users tweaking them
regularly.

But the big picture here is that if you want to ever have parallelism
in PostgreSQL at all, you're going to have to live with the first
version being pretty crude. I think it's quite likely that the first
version of parallel sequential scan will be just as buggy as Hot
Standby was when we first added it, or as buggy as the multi-xact code
was when it went in, and probably subject to an even greater variety
of taxing limitations than any feature we've committed in the 6 years
I've been involved in the project. We get to pick between that and
not having it at all.

If it's disabled by default then I'm worried it won't really improve
until it is. Perhaps that's setting a higher bar than you feel is
necessary but, for my part at least, it doesn't feel like a very high
level.

I'll take a look at the papers you sent about parallel query
optimization, but personally I think that's putting the cart not only
before the horse but also before the road. For V1, we need a query
optimization model that does not completely suck - no more. The key
criterion here is that this has to WORK. There will be time enough to
improve everything else once we reach that goal.

I agree that it's got to work, but it also needs to be generally well
designed, and have the expectation of being on by default.

Thanks,

Stephen

#31Stephen Frost
sfrost@snowman.net
In reply to: Heikki Linnakangas (#28)
Re: Parallel Seq Scan

* Heikki Linnakangas (hlinnakangas@vmware.com) wrote:

On 12/19/2014 04:39 PM, Stephen Frost wrote:

* Marko Tiikkaja (marko@joh.to) wrote:

I'd be perfectly (that means 100%) happy if it just defaulted to
off, but I could turn it up to 11 whenever I needed it. I don't
believe to be the only one with this opinion, either.

Perhaps we should reconsider our general position on hints then and
add them so users can define the plan to be used.. For my part, I don't
see this as all that much different.

Consider if we were just adding HashJoin support today as an example.
Would we be happy if we had to default to enable_hashjoin = off? Or if
users had to do that regularly because our costing was horrid? It's bad
enough that we have to resort to those tweaks today in rare cases.

This is somewhat different. Imagine that we achieve perfect
parallelization, so that when you set enable_parallel_query=8, every
query runs exactly 8x faster on an 8-core system, by using all eight
cores.

To be clear, as I mentioned to Robert just now, I'm not objecting to a
GUC being added to turn off or control parallelization. I don't want
such a GUC to be a crutch for us to lean on when it comes to questions
about the optimizer though. We need to work through the optimizer
questions of "should this be parallelized?" and, perhaps later, "how
many ways is it sensible to parallelize this?" I'm worried we'll take
such a GUC as a directive along the lines of "we are being told to
parallelize to exactly this level every time and for every query which
can be." The GUC should be an input into the planner/optimizer much the
way enable_hashjoin is, unless it's being done as a *limiting* factor
for the administrator to be able to control, but we've generally avoided
doing that (see: work_mem) and, if we're going to start, we should
probably come up with an approach that addresses the considerations for
other resources too.

Thanks,

Stephen

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#22)
Re: Parallel Seq Scan

On Fri, Dec 19, 2014 at 6:21 PM, Stephen Frost <sfrost@snowman.net> wrote:

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

1. Parallel workers help a lot when there is an expensive qualification
to evaluated, the more expensive the qualification the more better are
results.

I'd certainly hope so. ;)

2. It works well for low selectivity quals and as the selectivity

increases,

the benefit tends to go down due to additional tuple communication cost
between workers and master backend.

I'm a bit sad to hear that the communication between workers and the
master backend is already being a bottleneck. Now, that said, the box
you're playing with looks to be pretty beefy and therefore the i/o
subsystem might be particularly good, but generally speaking, it's a lot
faster to move data in memory than it is to pull it off disk, and so I
wouldn't expect the tuple communication between processes to really be
the bottleneck...

The main reason for higher cost of tuple communication is because at
this moment I have used an approach to pass the tuples which is
comparatively
less error prone and could be used as per existing FE/BE protocol.

To explain in brief, what is happening here is that currently worker backend
gets the tuple from page which it is deforms and send the same to master
backend via message queue, master backend then forms the tuple and send it
to upper layer which before sending it to frontend again deforms it via
slot_getallattrs(slot). The benefit of using this approach is that it works
as per current protocol message ('D') and as per our current executor code.

Now there could be couple of ways with which we can reduce the tuple
communication overhead.

a. Instead of passing value array, just pass tuple id, but retain the
buffer pin till master backend reads the tuple based on tupleid.
This has side effect that we have to retain buffer pin for longer
period of time, but again that might not have any problem in
real world usage of parallel query.

b. Instead of passing value array, pass directly the tuple which could
be directly propagated by master backend to upper layer or otherwise
in master backend change some code such that it could propagate the
tuple array received via shared memory queue directly to frontend.
Basically save the one extra cycle of form/deform tuple.

Both these need some new message type and handling for same in
Executor code.

Having said above, I think we can try to optimize this in multiple
ways, however we need additional mechanism and changes in Executor
code which is error prone and doesn't seem to be important at this
stage where we want the basic feature to work.

3. After certain point, increasing having more number of workers won't
help and rather have negative impact, refer Test-4.

Yes, I see that too and it's also interesting- have you been able to
identify why? What is the overhead (specifically) which is causing
that?

I think there are mainly two things which can lead to benefit
by employing parallel workers
a. Better use of available I/O bandwidth
b. Better use of available CPU's by doing expression evaluation
by multiple workers.

The simple theory here is that there has to be certain limit
(in terms of number of parallel workers) till which there can
be benefit due to both of the above points and after which there
will be overhead (setting up so many workers even though they
are not required, then some additional wait by master backend
for non-helping workers to finish their work, then if there
are not enough CPU's available and may be others as well like
overusing I/O channel might also degrade the performance
rather than improving it).

In the above tests, it seems to me that the maximum benefit due to
'a' is realized upto 4~8 workers and the maximum benefit due to
'b' depends upon the complexity (time to evaluate) of expression.
That is the reason why we can see benefit's in Tests-1 ~ Test-3 above
8 parallel workers as well whereas for Tests-4 to Tests-6 it maximizes
at 8 workers and after that either there is no improvement or
degradation due to one or more reasons as explained in previous
paragraph.

I think important point which is mentioned by you as well is
that there should be a reasonably good cost model which can
account some or all of these things so that by using parallel
query user can achieve the benefit it provides and won't have
to pay the cost in which there is no or less benefit.
I am not sure that in first cut we can come up with a highly
robust cost model, but it should not be too weak that most
of the time user has to find the right tuning based on parameters
we are going to add. Based on my understanding and by referring
to existing literature, I will try to come up with the cost model
and then we can have a discussion if required whether that is good
enough for first cut or not.

I think as discussed previously we need to introduce 2 additional cost
variables (parallel_startup_cost, cpu_tuple_communication_cost) to
estimate the parallel seq scan cost so that when the tables are small
or selectivity is high, it should increase the cost of parallel plan.

I agree that we need to figure out a way to cost out parallel plans, but
I have doubts about these being the right way to do that. There has
been quite a bit of literature regarding parallel execution and
planning- have you had a chance to review anything along those lines?

Not now, but sometime back I had read quite a few papers on parallelism,
I will refer some of them again before deciding the exact cost model
and might as well discuss about them.

We certainly like to draw on previous experiences and analysis rather
than trying to pave our own way.

With these additional costs comes the consideration that we're looking
for a wall-clock runtime proxy and therefore, while we need to add costs
for parallel startup and tuple communication, we have to reduce the
overall cost because of the parallelism or we'd never end up choosing a
parallel plan. Is the thought to simply add up all the costs and then
divide? Or perhaps to divide the cost of the actual plan but then add
in the parallel startup cost and the tuple communication cost?

Perhaps there has been prior discussion on these points but I'm thinking
we need a README or similar which discusses all of this and includes any
references out to academic papers or similar as appropriate.

Got the point, I think we need to mention somewhere either in README or
in some file header.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#33Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Amit Kapila (#32)
Re: Parallel Seq Scan

On 12/21/14, 12:42 AM, Amit Kapila wrote:

On Fri, Dec 19, 2014 at 6:21 PM, Stephen Frost <sfrost@snowman.net <mailto:sfrost@snowman.net>> wrote:
a. Instead of passing value array, just pass tuple id, but retain the
buffer pin till master backend reads the tuple based on tupleid.
This has side effect that we have to retain buffer pin for longer
period of time, but again that might not have any problem in
real world usage of parallel query.

b. Instead of passing value array, pass directly the tuple which could
be directly propagated by master backend to upper layer or otherwise
in master backend change some code such that it could propagate the
tuple array received via shared memory queue directly to frontend.
Basically save the one extra cycle of form/deform tuple.

Both these need some new message type and handling for same in
Executor code.

Having said above, I think we can try to optimize this in multiple
ways, however we need additional mechanism and changes in Executor
code which is error prone and doesn't seem to be important at this
stage where we want the basic feature to work.

Would b require some means of ensuring we didn't try and pass raw tuples to frontends? Other than that potential wrinkle, it seems like less work than a.

...

I think there are mainly two things which can lead to benefit
by employing parallel workers
a. Better use of available I/O bandwidth
b. Better use of available CPU's by doing expression evaluation
by multiple workers.

...

In the above tests, it seems to me that the maximum benefit due to
'a' is realized upto 4~8 workers

I'd think a good first estimate here would be to just use effective_io_concurrency.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Jim Nasby (#33)
Re: Parallel Seq Scan

On Mon, Dec 22, 2014 at 7:34 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 12/21/14, 12:42 AM, Amit Kapila wrote:

On Fri, Dec 19, 2014 at 6:21 PM, Stephen Frost <sfrost@snowman.net

<mailto:sfrost@snowman.net>> wrote:

a. Instead of passing value array, just pass tuple id, but retain the
buffer pin till master backend reads the tuple based on tupleid.
This has side effect that we have to retain buffer pin for longer
period of time, but again that might not have any problem in
real world usage of parallel query.

b. Instead of passing value array, pass directly the tuple which could
be directly propagated by master backend to upper layer or otherwise
in master backend change some code such that it could propagate the
tuple array received via shared memory queue directly to frontend.
Basically save the one extra cycle of form/deform tuple.

Both these need some new message type and handling for same in
Executor code.

Having said above, I think we can try to optimize this in multiple
ways, however we need additional mechanism and changes in Executor
code which is error prone and doesn't seem to be important at this
stage where we want the basic feature to work.

Would b require some means of ensuring we didn't try and pass raw tuples

to frontends?

That seems to be already there, before sending the tuple
to frontend, we already ensure to deform it (refer printtup()->

slot_getallattrs())

Other than that potential wrinkle, it seems like less work than a.

Here, I am assuming that you are mentioning about *pass the tuple*
directly approach; We also need to devise a new protocol message
and mechanism to directly pass the tuple via shared memory queues,
also I think currently we can send only the things via shared memory
queues which we can do via FE/BE protocol and we don't send tuples
directly to frontend. Apart from this, I am not sure how much benefit it
can give, because it will reduce one part of tuple communication, but still
the amount of data transferred will be almost same.

This is an area of improvement which needs more investigation and even
without this we can get benefit in many cases as shown upthread and
I think after that we can try to parallelize the aggregation (Simon Riggs
and
David Rowley have already worked out some infrastructure for the same)
that will surely give us good benefits. So I suggest it's better to focus
on
the remaining things with which this patch could be in a shape (in terms of
robustness/stability) where it can be accepted rather than trying to
optimize tuple communication which we can do later as well.

...

I think there are mainly two things which can lead to benefit
by employing parallel workers
a. Better use of available I/O bandwidth
b. Better use of available CPU's by doing expression evaluation
by multiple workers.

...

In the above tests, it seems to me that the maximum benefit due to
'a' is realized upto 4~8 workers

I'd think a good first estimate here would be to just use

effective_io_concurrency.

One thing we should be cautious about this parameter is that currently
it is mapped to number of pages that needs to prefetched, and using
it for deciding degree of parallelism could be slightly tricky, however I
will consider it while working on cost model.

Thanks for your suggestions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#35Thom Brown
thom@linux.com
In reply to: Amit Kapila (#21)
Re: Parallel Seq Scan

On 18 December 2014 at 16:03, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 18, 2014 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net>

wrote:

So to summarize my understanding, below are the set of things
which I should work on and in the order they are listed.

1. Push down qualification
2. Performance Data
3. Improve the way to push down the information related to worker.
4. Dynamic allocation of work for workers.

I have worked on the patch to accomplish above mentioned points
1, 2 and partly 3 and would like to share the progress with community.

Sorry forgot to attach updated patch in last mail, attaching it now.

When attempting to recreate the plan in your example, I get an error:

➤ psql://thom@[local]:5488/pgbench

# create table t1(c1 int, c2 char(500)) with (fillfactor=10);
CREATE TABLE
Time: 13.653 ms

➤ psql://thom@[local]:5488/pgbench

# insert into t1 values(generate_series(1,100),'amit');
INSERT 0 100
Time: 4.796 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
ERROR: could not register background process
HINT: You may need to increase max_worker_processes.
Time: 1.659 ms

➤ psql://thom@[local]:5488/pgbench

# show max_worker_processes ;
max_worker_processes
----------------------
8
(1 row)

Time: 0.199 ms

# show parallel_seqscan_degree ;
parallel_seqscan_degree
-------------------------
10
(1 row)

Should I really need to increase max_worker_processes to >=
parallel_seqscan_degree? If so, shouldn't there be a hint here along with
the error message pointing this out? And should the error be produced when
only a *plan* is being requested?

Also, I noticed that where a table is partitioned, the plan isn't
parallelised:

# explain select distinct bid from pgbench_accounts;

QUERY
PLAN
----------------------------------------------------------------------------------------
HashAggregate (cost=1446639.00..1446643.99 rows=499 width=4)
Group Key: pgbench_accounts.bid
-> Append (cost=0.00..1321639.00 rows=50000001 width=4)
-> Seq Scan on pgbench_accounts (cost=0.00..0.00 rows=1 width=4)
-> Seq Scan on pgbench_accounts_1 (cost=0.00..4279.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_2 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_3 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_4 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_5 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_6 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_7 (cost=0.00..2640.00
rows=100000 width=4)
...
-> Seq Scan on pgbench_accounts_498 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_499 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_500 (cost=0.00..2640.00
rows=100000 width=4)
(504 rows)

Is this expected?

Thom

#36Thom Brown
thom@linux.com
In reply to: Thom Brown (#35)
Re: Parallel Seq Scan

On 31 December 2014 at 14:20, Thom Brown <thom@linux.com> wrote:

On 18 December 2014 at 16:03, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 18, 2014 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net>

wrote:

So to summarize my understanding, below are the set of things
which I should work on and in the order they are listed.

1. Push down qualification
2. Performance Data
3. Improve the way to push down the information related to worker.
4. Dynamic allocation of work for workers.

I have worked on the patch to accomplish above mentioned points
1, 2 and partly 3 and would like to share the progress with community.

Sorry forgot to attach updated patch in last mail, attaching it now.

When attempting to recreate the plan in your example, I get an error:

➤ psql://thom@[local]:5488/pgbench

# create table t1(c1 int, c2 char(500)) with (fillfactor=10);
CREATE TABLE
Time: 13.653 ms

➤ psql://thom@[local]:5488/pgbench

# insert into t1 values(generate_series(1,100),'amit');
INSERT 0 100
Time: 4.796 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
ERROR: could not register background process
HINT: You may need to increase max_worker_processes.
Time: 1.659 ms

➤ psql://thom@[local]:5488/pgbench

# show max_worker_processes ;
max_worker_processes
----------------------
8
(1 row)

Time: 0.199 ms

# show parallel_seqscan_degree ;
parallel_seqscan_degree
-------------------------
10
(1 row)

Should I really need to increase max_worker_processes to >=
parallel_seqscan_degree? If so, shouldn't there be a hint here along with
the error message pointing this out? And should the error be produced when
only a *plan* is being requested?

Also, I noticed that where a table is partitioned, the plan isn't
parallelised:

# explain select distinct bid from pgbench_accounts;

QUERY
PLAN

----------------------------------------------------------------------------------------
HashAggregate (cost=1446639.00..1446643.99 rows=499 width=4)
Group Key: pgbench_accounts.bid
-> Append (cost=0.00..1321639.00 rows=50000001 width=4)
-> Seq Scan on pgbench_accounts (cost=0.00..0.00 rows=1 width=4)
-> Seq Scan on pgbench_accounts_1 (cost=0.00..4279.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_2 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_3 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_4 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_5 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_6 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_7 (cost=0.00..2640.00
rows=100000 width=4)
...
-> Seq Scan on pgbench_accounts_498 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_499 (cost=0.00..2640.00
rows=100000 width=4)
-> Seq Scan on pgbench_accounts_500 (cost=0.00..2640.00
rows=100000 width=4)
(504 rows)

Is this expected?

Another issue (FYI, pgbench2 initialised with: pgbench -i -s 100 -F 10
pgbench2):

➤ psql://thom@[local]:5488/pgbench2

# explain select distinct bid from pgbench_accounts;
QUERY
PLAN
-------------------------------------------------------------------------------------------
HashAggregate (cost=245833.38..245834.38 rows=100 width=4)
Group Key: bid
-> Parallel Seq Scan on pgbench_accounts (cost=0.00..220833.38
rows=10000000 width=4)
Number of Workers: 8
Number of Blocks Per Workers: 208333
(5 rows)

Time: 7.476 ms

➤ psql://thom@[local]:5488/pgbench2

# explain (analyse, buffers, verbose) select distinct bid from
pgbench_accounts;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 14897.991 ms

The logs say:

2014-12-31 15:21:42 GMT [9164]: [240-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [241-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [242-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [243-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [244-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [245-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [246-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [247-1] user=,db=,client= LOG: registering
background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [248-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [249-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [250-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [251-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [252-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [253-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [254-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [255-1] user=,db=,client= LOG: starting
background worker process "backend_worker"
2014-12-31 15:21:46 GMT [9164]: [256-1] user=,db=,client= LOG: worker
process: backend_worker (PID 10887) exited with exit code 1
2014-12-31 15:21:46 GMT [9164]: [257-1] user=,db=,client= LOG:
unregistering background worker "backend_worker"
2014-12-31 15:21:50 GMT [9164]: [258-1] user=,db=,client= LOG: worker
process: backend_worker (PID 10888) exited with exit code 1
2014-12-31 15:21:50 GMT [9164]: [259-1] user=,db=,client= LOG:
unregistering background worker "backend_worker"
2014-12-31 15:21:57 GMT [9164]: [260-1] user=,db=,client= LOG: server
process (PID 10869) was terminated by signal 9: Killed
2014-12-31 15:21:57 GMT [9164]: [261-1] user=,db=,client= DETAIL: Failed
process was running: explain (analyse, buffers, verbose) select distinct
bid from pgbench_accounts;
2014-12-31 15:21:57 GMT [9164]: [262-1] user=,db=,client= LOG: terminating
any other active server processes

Running it again, I get the same issue. This is with
parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.

This doesn't happen if I set the pgbench scale to 50. I suspect this is a
OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale 100,
and I don't have swap enabled. But I'm concerned it crashes the whole
instance.

I also notice that requesting BUFFERS in a parallel EXPLAIN output yields
no such information. Is that not possible to report?
--
Thom

#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#35)
Re: Parallel Seq Scan

On Wed, Dec 31, 2014 at 7:50 PM, Thom Brown <thom@linux.com> wrote:

When attempting to recreate the plan in your example, I get an error:

➤ psql://thom@[local]:5488/pgbench

# create table t1(c1 int, c2 char(500)) with (fillfactor=10);
CREATE TABLE
Time: 13.653 ms

➤ psql://thom@[local]:5488/pgbench

# insert into t1 values(generate_series(1,100),'amit');
INSERT 0 100
Time: 4.796 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
ERROR: could not register background process
HINT: You may need to increase max_worker_processes.
Time: 1.659 ms

➤ psql://thom@[local]:5488/pgbench

# show max_worker_processes ;
max_worker_processes
----------------------
8
(1 row)

Time: 0.199 ms

# show parallel_seqscan_degree ;
parallel_seqscan_degree
-------------------------
10
(1 row)

Should I really need to increase max_worker_processes to >=

parallel_seqscan_degree?

Yes, as the parallel workers are implemented based on dynamic
bgworkers, so it is dependent on max_worker_processes.

If so, shouldn't there be a hint here along with the error message

pointing this out? And should the error be produced when only a *plan* is
being requested?

I think one thing we could do minimize the chance of such an
error is set the value of parallel workers to be used for plan equal
to max_worker_processes if parallel_seqscan_degree is greater
than max_worker_processes. Even if we do this, still such an
error can come if user has registered bgworker before we could
start parallel plan execution.

Also, I noticed that where a table is partitioned, the plan isn't

parallelised:

Is this expected?

Yes, to keep the initial implementation simple, it allows the
parallel plan when there is single table in query.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#36)
Re: Parallel Seq Scan

On Wed, Dec 31, 2014 at 9:46 PM, Thom Brown <thom@linux.com> wrote:

Another issue (FYI, pgbench2 initialised with: pgbench -i -s 100 -F 10

pgbench2):

➤ psql://thom@[local]:5488/pgbench2

# explain (analyse, buffers, verbose) select distinct bid from

pgbench_accounts;

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 14897.991 ms

2014-12-31 15:21:57 GMT [9164]: [260-1] user=,db=,client= LOG: server

process (PID 10869) was terminated by signal 9: Killed

2014-12-31 15:21:57 GMT [9164]: [261-1] user=,db=,client= DETAIL: Failed

process was running: explain (analyse, buffers, verbose) select distinct
bid from pgbench_accounts;

2014-12-31 15:21:57 GMT [9164]: [262-1] user=,db=,client= LOG:

terminating any other active server processes

Running it again, I get the same issue. This is with

parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.

This doesn't happen if I set the pgbench scale to 50. I suspect this is

a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale
100, and I don't have swap enabled. But I'm concerned it crashes the whole
instance.

Isn't this a backend crash due to OOM?
And after that server will restart automatically.

I also notice that requesting BUFFERS in a parallel EXPLAIN output yields

no such information.

--

Yeah and the reason for same is that all the work done related
to BUFFERS is done by backend workers, master backend
doesn't read any pages, so it is not able to accumulate this
information.

Is that not possible to report?

It is not impossible to report such information, we can develop some
way to share such information between master backend and workers.
I think we can do this if required once the patch is more stablized.

Thanks for looking into patch and reporting the issues.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#39Fabrízio de Royes Mello
fabriziomello@gmail.com
In reply to: Amit Kapila (#37)
Re: Parallel Seq Scan

I think one thing we could do minimize the chance of such an
error is set the value of parallel workers to be used for plan equal
to max_worker_processes if parallel_seqscan_degree is greater
than max_worker_processes. Even if we do this, still such an
error can come if user has registered bgworker before we could
start parallel plan execution.

Can we check the number of free bgworkers slots to set the max workers?

Regards,

Fabrízio Mello

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

Show quoted text

Timbira: http://www.timbira.com.br
Blog: http://fabriziomello.github.io
Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
Github: http://github.com/fabriziomello

#40Robert Haas
robertmhaas@gmail.com
In reply to: Fabrízio de Royes Mello (#39)
Re: Parallel Seq Scan

On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:

Can we check the number of free bgworkers slots to set the max workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers. It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all. I think a lot of this work will fall out
quite naturally if this patch is reworked to use the parallel
mode/parallel context stuff, the latest version of which includes an
example of how to set up a parallel scan in such a manner that it can
run with any number of workers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Thom Brown
thom@linux.com
In reply to: Robert Haas (#40)
Re: Parallel Seq Scan

On 1 January 2015 at 17:59, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:

Can we check the number of free bgworkers slots to set the max workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers. It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all. I think a lot of this work will fall out
quite naturally if this patch is reworked to use the parallel
mode/parallel context stuff, the latest version of which includes an
example of how to set up a parallel scan in such a manner that it can
run with any number of workers.

+1

That sounds like exactly what's needed.

Thom

#42Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#40)
Re: Parallel Seq Scan

On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:

Can we check the number of free bgworkers slots to set the max workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers. It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all.

I think handling this way can have one side effect which is that if
we degrade to smaller number, then the cost of plan (which was
decided by optimizer based on number of parallel workers) could
be more than non-parallel scan.
Ideally before finalizing the parallel plan we should reserve the
bgworkers required to execute that plan, but I think as of now
we can workout a solution without it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#43Thom Brown
thom@linux.com
In reply to: Amit Kapila (#38)
Re: Parallel Seq Scan

On 1 January 2015 at 10:34, Amit Kapila <amit.kapila16@gmail.com> wrote:

Running it again, I get the same issue. This is with

parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.

This doesn't happen if I set the pgbench scale to 50. I suspect this is

a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale
100, and I don't have swap enabled. But I'm concerned it crashes the whole
instance.

Isn't this a backend crash due to OOM?
And after that server will restart automatically.

Yes, I'm fairly sure it is. I guess what I'm confused about is that 8
parallel sequential scans in separate sessions (1 per session) don't cause
the server to crash, but in a single session (8 in 1 session), they do.

I also notice that requesting BUFFERS in a parallel EXPLAIN output

yields no such information.

--

Yeah and the reason for same is that all the work done related
to BUFFERS is done by backend workers, master backend
doesn't read any pages, so it is not able to accumulate this
information.

Is that not possible to report?

It is not impossible to report such information, we can develop some
way to share such information between master backend and workers.
I think we can do this if required once the patch is more stablized.

Ah great, as I think losing such information to this feature would be
unfortunate.

Will there be a GUC to influence parallel scan cost? Or does it take into
account effective_io_concurrency in the costs?

And will the planner be able to decide whether or not it'll choose to use
background workers or not? For example:

# explain (analyse, buffers, verbose) select distinct bid from
pgbench_accounts;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=89584.00..89584.05 rows=5 width=4) (actual
time=228.222..228.224 rows=5 loops=1)
Output: bid
Group Key: pgbench_accounts.bid
Buffers: shared hit=83334
-> Seq Scan on public.pgbench_accounts (cost=0.00..88334.00
rows=500000 width=4) (actual time=0.008..136.522 rows=500000 loops=1)
Output: bid
Buffers: shared hit=83334
Planning time: 0.071 ms
Execution time: 228.265 ms
(9 rows)

This is a quick plan, but if we tell it that it's allowed 8 background
workers:

# set parallel_seqscan_degree = 8;
SET
Time: 0.187 ms

# explain (analyse, buffers, verbose) select distinct bid from
pgbench_accounts;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12291.75..12291.80 rows=5 width=4) (actual
time=603.042..603.042 rows=1 loops=1)
Output: bid
Group Key: pgbench_accounts.bid
-> Parallel Seq Scan on public.pgbench_accounts (cost=0.00..11041.75
rows=500000 width=4) (actual time=2.445..529.284 rows=500000 loops=1)
Output: bid
Number of Workers: 8
Number of Blocks Per Workers: 10416
Planning time: 0.049 ms
Execution time: 663.103 ms
(9 rows)

Time: 663.437 ms

It's significantly slower. I'd hope the planner would anticipate this and
decide, "I'm just gonna perform a single scan in this instance as it'll be
a lot quicker for this simple case." So at the moment
parallel_seqscan_degree seems to mean "You *must* use this many threads if
you can parallelise." Ideally we'd be saying "can use up to if necessary".

Thanks

Thom

#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#43)
Re: Parallel Seq Scan

On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown <thom@linux.com> wrote:

On 1 January 2015 at 10:34, Amit Kapila <amit.kapila16@gmail.com> wrote:

Running it again, I get the same issue. This is with

parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.

This doesn't happen if I set the pgbench scale to 50. I suspect this

is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale
100, and I don't have swap enabled. But I'm concerned it crashes the whole
instance.

Isn't this a backend crash due to OOM?
And after that server will restart automatically.

Yes, I'm fairly sure it is. I guess what I'm confused about is that 8

parallel sequential scans in separate sessions (1 per session) don't cause
the server to crash, but in a single session (8 in 1 session), they do.

It could be possible that master backend retains some memory
for longer period which causes it to hit OOM error, by the way
in your test does always master backend hits OOM or is it
random (either master or worker)

Will there be a GUC to influence parallel scan cost? Or does it take

into account effective_io_concurrency in the costs?

And will the planner be able to decide whether or not it'll choose to use

background workers or not? For example:

Yes, we are planing to introduce cost model for parallel
communication (there is some discussion about the same
upthread), but it's still not there and that's why you
are seeing it to choose parallel plan when it shouldn't.
Currently in patch, if you set parallel_seqscan_degree, it
will most probably choose parallel plan only.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#45Thom Brown
thom@linux.com
In reply to: Amit Kapila (#44)
Re: Parallel Seq Scan

On 2 January 2015 at 11:13, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown <thom@linux.com> wrote:

On 1 January 2015 at 10:34, Amit Kapila <amit.kapila16@gmail.com> wrote:

Running it again, I get the same issue. This is with

parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.

This doesn't happen if I set the pgbench scale to 50. I suspect this

is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale
100, and I don't have swap enabled. But I'm concerned it crashes the whole
instance.

Isn't this a backend crash due to OOM?
And after that server will restart automatically.

Yes, I'm fairly sure it is. I guess what I'm confused about is that 8

parallel sequential scans in separate sessions (1 per session) don't cause
the server to crash, but in a single session (8 in 1 session), they do.

It could be possible that master backend retains some memory
for longer period which causes it to hit OOM error, by the way
in your test does always master backend hits OOM or is it
random (either master or worker)

Just ran a few tests, and it appears to always be the master that hits OOM,
or at least I don't seem to be able to get an example of the worker hitting
it.

Will there be a GUC to influence parallel scan cost? Or does it take

into account effective_io_concurrency in the costs?

And will the planner be able to decide whether or not it'll choose to

use background workers or not? For example:

Yes, we are planing to introduce cost model for parallel
communication (there is some discussion about the same
upthread), but it's still not there and that's why you
are seeing it to choose parallel plan when it shouldn't.
Currently in patch, if you set parallel_seqscan_degree, it
will most probably choose parallel plan only.

Ah, okay. Great.

Thanks.

Thom

#46Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#42)
Re: Parallel Seq Scan

On Fri, Jan 2, 2015 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:

Can we check the number of free bgworkers slots to set the max workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers. It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all.

I think handling this way can have one side effect which is that if
we degrade to smaller number, then the cost of plan (which was
decided by optimizer based on number of parallel workers) could
be more than non-parallel scan.
Ideally before finalizing the parallel plan we should reserve the
bgworkers required to execute that plan, but I think as of now
we can workout a solution without it.

I don't think this is very practical. When cached plans are in use,
we can have a bunch of plans sitting around that may or may not get
reused at some point in the future, possibly far in the future. The
current situation, which I think we want to maintain, is that such
plans hold no execution-time resources (e.g. locks) and, generally,
don't interfere with other things people might want to execute on the
system. Nailing down a bunch of background workers just in case we
might want to use them in the future would be pretty unfriendly.

I think it's right to view this in the same way we view work_mem. We
plan on the assumption that an amount of memory equal to work_mem will
be available at execution time, without actually reserving it. If the
plan happens to need that amount of memory and if it actually isn't
available when needed, then performance will suck; conceivably, the
OOM killer might trigger. But it's the user's job to avoid this by
not setting work_mem too high in the first place. Whether this system
is for the best is arguable: one can certainly imagine a system where,
if there's not enough memory at execution time, we consider
alternatives like (a) replanning with a lower memory target, (b)
waiting until more memory is available, or (c) failing outright in
lieu of driving the machine into swap. But devising such a system is
complicated -- for example, replanning with a lower memory target
might be latch onto a far more expensive plan, such that we would have
been better off waiting for more memory to be available; yet trying to
waiting until more memory is available might result in waiting
forever. And that's why we don't have such a system.

We don't need to do any better here. The GUC should tell us how many
parallel workers we should anticipate being able to obtain. If other
settings on the system, or the overall system load, preclude us from
obtaining that number of parallel workers, then the query will take
longer to execute; and the plan might be sub-optimal. If that happens
frequently, the user should lower the planner GUC to a level that
reflects the resources actually likely to be available at execution
time.

By the way, another area where this kind of effect crops up is with
the presence of particular disk blocks in shared_buffers or the system
buffer cache. Right now, the planner makes no attempt to cost a scan
of a frequently-used, fully-cached relation different than a
rarely-used, probably-not-cached relation; and that sometimes leads to
bad plans. But if it did try to do that, then we'd have the same kind
of problem discussed here -- things might change between planning and
execution, or even after the beginning of execution. Also, we might
get nasty feedback effects: since the relation isn't cached, we view a
plan that would involve reading it in as very expensive, and avoid
such a plan. However, we might be better off picking the "slow" plan
anyway, because it might be that once we've read the data once it will
stay cached and run much more quickly than some plan that seems better
starting from a cold cache.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#46)
Re: Parallel Seq Scan

* Robert Haas (robertmhaas@gmail.com) wrote:

I think it's right to view this in the same way we view work_mem. We
plan on the assumption that an amount of memory equal to work_mem will
be available at execution time, without actually reserving it.

Agreed- this seems like a good approach for how to address this. We
should still be able to end up with plans which use less than the max
possible parallel workers though, as I pointed out somewhere up-thread.
This is also similar to work_mem- we certainly have plans which don't
expect to use all of work_mem and others that expect to use all of it
(per node, of course).

Thanks,

Stephen

#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#46)
Re: Parallel Seq Scan

On Mon, Jan 5, 2015 at 8:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 2, 2015 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:

Can we check the number of free bgworkers slots to set the max

workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers. It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all.

I think handling this way can have one side effect which is that if
we degrade to smaller number, then the cost of plan (which was
decided by optimizer based on number of parallel workers) could
be more than non-parallel scan.
Ideally before finalizing the parallel plan we should reserve the
bgworkers required to execute that plan, but I think as of now
we can workout a solution without it.

I don't think this is very practical. When cached plans are in use,
we can have a bunch of plans sitting around that may or may not get
reused at some point in the future, possibly far in the future. The
current situation, which I think we want to maintain, is that such
plans hold no execution-time resources (e.g. locks) and, generally,
don't interfere with other things people might want to execute on the
system. Nailing down a bunch of background workers just in case we
might want to use them in the future would be pretty unfriendly.

I think it's right to view this in the same way we view work_mem. We
plan on the assumption that an amount of memory equal to work_mem will
be available at execution time, without actually reserving it.

Are we sure that in such cases we will consume work_mem during
execution? In cases of parallel_workers we are sure to an extent
that if we reserve the workers then we will use it during execution.
Nonetheless, I have proceded and integrated the parallel_seq scan
patch with v0.3 of parallel_mode patch posted by you at below link:
/messages/by-id/CA+TgmoYmp_=XcJEhvJZt9P8drBgW-pDpjHxBhZA79+M4o-CZQA@mail.gmail.com

Few things to note about this integrated patch are:
1. In this new patch, I have just integrated it with Robert's parallel_mode
patch and not done any further development or fixed known things
like changes in optimizer, prepare queries, etc. You might notice
that new patch has lesser size as compare to previous patch and the
reason is that there were some duplicate stuff between previous
version of parallel_seqscan patch and parallel_mode which I have
eliminated.

2. To enable two types of shared memory queue's (error queue and
tuple queue), we need to ensure that we switch to appropriate queue
during communication of various messages from parallel worker
to master backend. There are two ways to do it
a. Save the information about error queue during startup of parallel
worker (ParallelMain()) and then during error, set the same (switch
to error queue in errstart() and switch back to tuple queue in
errfinish() and errstart() in case errstart() doesn't need to
propagate
error).
b. Do something similar as (a) for tuple queue in printtup or other
place
if any for non-error messages.
I think approach (a) is slightly better as compare to approach (b) as
we need to switch many times for tuple queue (for each tuple) and
there could be multiple places where we need to do the same. For now,
I have used approach (a) in Patch which needs some more work if we
agree on the same.

3. As per current implementation of Parallel_seqscan, it needs to use
some information from parallel.c which was not exposed, so I have
exposed the same by moving it to parallel.h. Information that is required
is as follows:
ParallelWorkerNumber, FixedParallelState and shm keys -
This is used to decide the blocks that needs to be scanned.
We might change it in future the way parallel scan/work distribution
is done, but I don't see any harm in exposing this information.

4. Sending ReadyForQuery

If the
plan happens to need that amount of memory and if it actually isn't
available when needed, then performance will suck; conceivably, the
OOM killer might trigger. But it's the user's job to avoid this by
not setting work_mem too high in the first place. Whether this system
is for the best is arguable: one can certainly imagine a system where,
if there's not enough memory at execution time, we consider
alternatives like (a) replanning with a lower memory target, (b)
waiting until more memory is available, or (c) failing outright in
lieu of driving the machine into swap. But devising such a system is
complicated -- for example, replanning with a lower memory target
might be latch onto a far more expensive plan, such that we would have
been better off waiting for more memory to be available; yet trying to
waiting until more memory is available might result in waiting
forever. And that's why we don't have such a system.

We don't need to do any better here. The GUC should tell us how many
parallel workers we should anticipate being able to obtain. If other
settings on the system, or the overall system load, preclude us from
obtaining that number of parallel workers, then the query will take
longer to execute; and the plan might be sub-optimal. If that happens
frequently, the user should lower the planner GUC to a level that
reflects the resources actually likely to be available at execution
time.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#49Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#48)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Jan 8, 2015 at 5:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 5, 2015 at 8:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Sorry for incomplete mail sent prior to this, I just hit the send button
by mistake.

4. Sending ReadyForQuery() after completely sending the tuples,
as that is required to know that all the tuples are received and I think
we should send the same on tuple queue rather than on error queue.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v3.patchapplication/octet-stream; name=parallel_seqscan_v3.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..91fbea5
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,359 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg);
+
+/*
+ * Indicate that an error came from a particular worker.
+ */
+static void
+worker_error_callback(void *arg)
+{
+	pid_t	pid = * (pid_t *) arg;
+
+	errcontext("worker backend, pid %d", pid);
+}
+
+/*
+ * shm_beginscan -
+ *		Initializes the shared memory scan descriptor to retrieve tuples
+ *		from worker backends. 
+ */
+ShmScanDesc
+shm_beginscan(int num_queues)
+{
+	ShmScanDesc		shmscan;
+
+	shmscan = palloc(sizeof(ShmScanDescData));
+
+	shmscan->num_shm_queues = num_queues;
+	shmscan->ss_cqueue = -1;
+	shmscan->shmscan_inited	= false;
+
+	return shmscan;
+}
+
+/*
+ * ExecInitWorkerResult -
+ *		Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(TupleDesc tupdesc)
+{
+	worker_result	workerResult;
+	int				i;
+	int	natts = tupdesc->natts;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+	workerResult->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	workerResult->typioparams = palloc(sizeof(Oid) * natts);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &workerResult->typioparams[i]);
+		fmgr_info(receive_function_id, &workerResult->receive_functions[i]);
+	}
+
+	return workerResult;
+}
+
+
+/*
+ * shm_getnext -
+ *		Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(ShmScanDesc shmScan, worker_result resultState,
+			shm_mq_handle **responseq, TupleDesc tupdesc)
+{
+	shm_mq_result	res;
+	char			msgtype;
+	Size			nbytes;
+	void			*data;
+	StringInfoData	msg;
+	int32			pid = 1234;
+	int				queueId = 0;
+
+	/*
+	 * calculate next starting queue used for fetching tuples
+	 */
+	if(!shmScan->shmscan_inited)
+	{
+		shmScan->shmscan_inited = true;
+		Assert(shmScan->num_shm_queues > 0);
+		queueId = 0;
+		--shmScan->num_shm_queues;
+	}
+	else
+		queueId = shmScan->ss_cqueue;
+
+	/* Initialize message buffer. */
+	initStringInfo(&msg);
+
+	/* Read and processes messages from the shared memory queues. */
+	for(;;)
+	{
+		for (;;)
+		{
+			/*
+			 * mark current queue used for fetching tuples, this is used
+			 * to fetch consecutive tuples from queue used in previous
+			 * fetch.
+			 */
+			shmScan->ss_cqueue = queueId;
+
+			/* Get next message. */
+			res = shm_mq_receive(responseq[queueId], &nbytes, &data, false);
+			if (res != SHM_MQ_SUCCESS)
+				break;
+
+			/*
+			 * Message-parsing routines operate on a null-terminated StringInfo,
+			 * so we must construct one.
+			 */
+			resetStringInfo(&msg);
+			enlargeStringInfo(&msg, nbytes);
+			msg.len = nbytes;
+			memcpy(msg.data, data, nbytes);
+			msg.data[nbytes] = '\0';
+			msgtype = pq_getmsgbyte(&msg);
+
+			/* Dispatch on message type. */
+			switch (msgtype)
+			{
+				case 'E':
+				case 'N':
+					{
+						ErrorData	edata;
+						ErrorContextCallback context;
+
+						/* Parse ErrorResponse or NoticeResponse. */
+						pq_parse_errornotice(&msg, &edata);
+
+						/*
+						 * Limit the maximum error level to ERROR.  We don't want
+						 * a FATAL inside the backend worker to kill the user
+						 * session.
+						 */
+						if (edata.elevel > ERROR)
+							edata.elevel = ERROR;
+
+						/*
+						 * Rethrow the error with an appropriate context method.
+						 * On error, we need to ensure that master backend stop
+						 * all other workers before propagating the error, so
+						 * we need to pass the pid's of all workers, so that same
+						 * can be done in error callback.
+						 * XXX - For now, I am just sending some random number, this
+						 * needs to be fixed.
+						 */
+						context.callback = worker_error_callback;
+						context.arg = (void *) &pid;
+						context.previous = error_context_stack;
+						error_context_stack = &context;
+						ThrowErrorData(&edata);
+						error_context_stack = context.previous;
+
+						break;
+					}
+				case 'A':
+					{
+						/* Propagate NotifyResponse. */
+						pq_putmessage(msg.data[0], &msg.data[1], nbytes - 1);
+						break;
+					}
+				case 'T':
+					{
+						int16	natts = pq_getmsgint(&msg, 2);
+						int16	i;
+
+						if (resultState->has_row_description)
+							elog(ERROR, "multiple RowDescription messages");
+						resultState->has_row_description = true;
+						if (natts != tupdesc->natts)
+							ereport(ERROR,
+									(errcode(ERRCODE_DATATYPE_MISMATCH),
+										errmsg("worker result rowtype does not match "
+										"the specified FROM clause rowtype")));
+
+						for (i = 0; i < natts; ++i)
+						{
+							Oid		type_id;
+
+							(void) pq_getmsgstring(&msg);	/* name */
+							(void) pq_getmsgint(&msg, 4);	/* table OID */
+							(void) pq_getmsgint(&msg, 2);	/* table attnum */
+							type_id = pq_getmsgint(&msg, 4);	/* type OID */
+							(void) pq_getmsgint(&msg, 2);	/* type length */
+							(void) pq_getmsgint(&msg, 4);	/* typmod */
+							(void) pq_getmsgint(&msg, 2);	/* format code */
+
+							if (type_id != tupdesc->attrs[i]->atttypid)
+								ereport(ERROR,
+										(errcode(ERRCODE_DATATYPE_MISMATCH),
+											errmsg("remote query result rowtype does not match "
+											"the specified FROM clause rowtype")));
+						}
+
+						pq_getmsgend(&msg);
+
+						break;
+					}
+				case 'D':
+					{
+						/* Handle DataRow message. */
+						HeapTuple	result;
+
+						result = form_result_tuple(resultState, tupdesc, &msg);
+						return result;
+					}
+				case 'C':
+					{
+						/*
+						 * Handle CommandComplete message. Ignore tags sent by
+						 * worker backend as we are anyway going to use tag of
+						 * master backend for sending the same to client.
+						 */
+						(void) pq_getmsgstring(&msg);
+						break;
+					}
+				case 'G':
+				case 'H':
+				case 'W':
+					{
+						ereport(ERROR,
+								(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									errmsg("COPY protocol not allowed in worker")));
+					}
+
+				case 'Z':
+					{
+						/* Handle ReadyForQuery message. */
+						resultState->complete = true;
+						break;
+					}
+				default:
+					elog(WARNING, "unknown message type: %c (%zu bytes)",
+							msg.data[0], nbytes);
+					break;
+			}
+		}
+
+		/* Check whether the connection was broken prematurely. */
+		if (!resultState->complete)
+			ereport(ERROR,
+					(errcode(ERRCODE_CONNECTION_FAILURE),
+					 errmsg("lost connection to worker process with PID %d",
+					 pid)));
+
+		/*
+		 * if we have exhausted data from all worker queues, then terminate
+		 * processing data from queues.
+		 */
+		if (shmScan->num_shm_queues <=0)
+			break;
+		else
+		{
+			++queueId;
+			--shmScan->num_shm_queues;
+			resultState->has_row_description = false;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * form_result_tuple -
+ * Parse a DataRow message and form a result tuple.
+ */
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg)
+{
+	/* Handle DataRow message. */
+	int16	natts = pq_getmsgint(msg, 2);
+	int16	i;
+	Datum  *values = NULL;
+	bool   *isnull = NULL;
+	StringInfoData	buf;
+
+	if (!resultState->has_row_description)
+		elog(ERROR, "DataRow not preceded by RowDescription");
+	if (natts != tupdesc->natts)
+		elog(ERROR, "malformed DataRow");
+	if (natts > 0)
+	{
+		values = palloc(natts * sizeof(Datum));
+		isnull = palloc(natts * sizeof(bool));
+	}
+	initStringInfo(&buf);
+
+	for (i = 0; i < natts; ++i)
+	{
+		int32	bytes = pq_getmsgint(msg, 4);
+
+		if (bytes < 0)
+		{
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											NULL,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = true;
+		}
+		else
+		{
+			resetStringInfo(&buf);
+			appendBinaryStringInfo(&buf, pq_getmsgbytes(msg, bytes), bytes);
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											&buf,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = false;
+		}
+	}
+
+	pq_getmsgend(msg);
+
+	return heap_form_tuple(tupdesc, values, isnull);
+}
\ No newline at end of file
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 71374cc..f46a1a3 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -39,43 +39,6 @@
  */
 #define	PARALLEL_ERROR_QUEUE_SIZE			16384
 
-/* Magic number for parallel context TOC. */
-#define PARALLEL_MAGIC						0x50477c7c
-
-/*
- * Magic numbers for parallel state sharing.  Higher-level code should use
- * smaller values, leaving these very large ones for use by this module.
- */
-#define PARALLEL_KEY_FIXED					UINT64CONST(0xFFFFFFFFFFFF0001)
-#define PARALLEL_KEY_ERROR_QUEUE			UINT64CONST(0xFFFFFFFFFFFF0002)
-#define PARALLEL_KEY_GUC					UINT64CONST(0xFFFFFFFFFFFF0003)
-#define PARALLEL_KEY_COMBO_CID				UINT64CONST(0xFFFFFFFFFFFF0004)
-#define PARALLEL_KEY_TRANSACTION_SNAPSHOT	UINT64CONST(0xFFFFFFFFFFFF0005)
-#define PARALLEL_KEY_ACTIVE_SNAPSHOT		UINT64CONST(0xFFFFFFFFFFFF0006)
-#define PARALLEL_KEY_TRANSACTION_STATE		UINT64CONST(0xFFFFFFFFFFFF0007)
-#define PARALLEL_KEY_EXTENSION_TRAMPOLINE	UINT64CONST(0xFFFFFFFFFFFF0008)
-
-/* Fixed-size parallel state. */
-typedef struct FixedParallelState
-{
-	/* Fixed-size state that workers must restore. */
-	Oid			database_id;
-	Oid			authenticated_user_id;
-	Oid			current_user_id;
-	int			sec_context;
-	PGPROC	   *parallel_master_pgproc;
-	pid_t		parallel_master_pid;
-	BackendId	parallel_master_backend_id;
-
-	/* Entrypoint for parallel workers. */
-	parallel_worker_main_type	entrypoint;
-
-	/* Track whether workers have attached. */
-	slock_t		mutex;
-	int			workers_expected;
-	int			workers_attached;
-} FixedParallelState;
-
 /*
  * Our parallel worker number.  We initialize this to -1, meaning that we are
  * not a parallel worker.  In parallel workers, it will be set to a value >= 0
@@ -713,7 +676,7 @@ ParallelMain(Datum main_arg)
 	 * Now that we have a resource owner, we can attach to the dynamic
 	 * shared memory segment and read the table of contents.
 	 */
-	seg = dsm_attach(DatumGetInt32(main_arg));
+	seg = dsm_attach(DatumGetUInt32(main_arg));
 	if (seg == NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -748,9 +711,12 @@ ParallelMain(Datum main_arg)
 		ParallelWorkerNumber * PARALLEL_ERROR_QUEUE_SIZE);
 	shm_mq_set_sender(mq, MyProc);
 	mqh = shm_mq_attach(mq, seg, NULL);
-	pq_redirect_to_shm_mq(mq, mqh);
+	pq_save_shm_mq_info(mq, mqh);
+	pq_save_parallel_master_info(fps->parallel_master_pid,
+								 fps->parallel_master_backend_id);
+	/*pq_redirect_to_shm_mq(mq, mqh);
 	pq_set_parallel_master(fps->parallel_master_pid,
-						   fps->parallel_master_backend_id);
+						   fps->parallel_master_backend_id);*/
 
 	/* Install an error-context callback. */
 	errctx.callback = ParallelErrorContext;
@@ -823,7 +789,7 @@ ParallelMain(Datum main_arg)
 	EndParallelWorkerTransaction();
 
 	/* Report success. */
-	ReadyForQuery(DestRemote);
+	/*ReadyForQuery(DestRemote);*/
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8a0be5d..560b0d7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -713,6 +713,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -909,6 +910,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_ParallelSeqScan:
+			pname = sname = "Parallel Seq Scan";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1058,6 +1062,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1324,6 +1329,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_ParallelSeqScan:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((ParallelSeqScan *) plan)->num_workers, es);
+			ExplainPropertyInteger("Number of Blocks Per Workers",
+				((ParallelSeqScan *) plan)->num_blocks_per_worker, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2141,6 +2156,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..9a8ca75 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,7 +21,7 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeSeqscan.o nodeParallelSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..f77a77f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeParallelSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +191,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_ParallelSeqScan:
+			result = (PlanState *) ExecInitParallelSeqScan((ParallelSeqScan *) node,
+														   estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +412,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			result = ExecParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +654,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			ExecEndParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 3f0d809..67eda93 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -118,7 +118,7 @@ ExecScan(ScanState *node,
 	/*
 	 * Fetch data from node
 	 */
-	qual = node->ps.qual;
+	qual = node->ps.qualPushed ? NIL : node->ps.qual;
 	projInfo = node->ps.ps_ProjInfo;
 	econtext = node->ps.ps_ExprContext;
 
diff --git a/src/backend/executor/nodeParallelSeqscan.c b/src/backend/executor/nodeParallelSeqscan.c
new file mode 100644
index 0000000..30570c9
--- /dev/null
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -0,0 +1,291 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeParallelSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeParallelSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecParallelSeqScan				sequentially scans a relation.
+ *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		ExecInitParallelSeqScan			creates and initializes a parallel seqscan node.
+ *		ExecEndParallelSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		SeqNext
+ *
+ *		This is a workhorse for ExecParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ParallelSeqNext(ParallelSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(node->pss_currentShmScanDesc, node->pss_workerResult,
+						node->responseq,
+						node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * ParallelSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+ParallelSeqRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, ParallelSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check
+	 * are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitParallelScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitParallelScanRelation(SeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/* initialize a heapscan */
+	currentScanDesc = heap_beginscan(currentRelation,
+									 estate->es_snapshot,
+									 0,
+									 NULL);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+ParallelSeqScanState *
+ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags)
+{
+	ParallelSeqScanState *parallelscanstate;
+	ShmScanDesc			 currentShmScanDesc;
+	worker_result		 workerResult;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	parallelscanstate = makeNode(ParallelSeqScanState);
+	parallelscanstate->ss.ps.plan = (Plan *) node;
+	parallelscanstate->ss.ps.state = estate;
+
+	/*
+	 * for parallel seq scan, qual is always pushed to be
+	 * evaluated by backend worker.
+	 */
+	parallelscanstate->ss.ps.qualPushed = true;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &parallelscanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	parallelscanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) parallelscanstate);
+	parallelscanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) parallelscanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &parallelscanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &parallelscanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitParallelScanRelation(&parallelscanstate->ss, estate, eflags);
+
+	parallelscanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&parallelscanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&parallelscanstate->ss);
+
+	/* Initialize the workers required to perform parallel scan. */
+	InitiateWorkers(parallelscanstate->ss.ss_currentRelation->rd_id,
+					node->scan.plan.targetlist,
+					node->scan.plan.qual,
+					&parallelscanstate->responseq,
+					&parallelscanstate->pcxt,
+					node->num_blocks_per_worker,
+					node->num_workers);
+
+	
+	/*
+	 * use result tuple descriptor to fetch data from shared memory queues
+	 * as the worker backends would have put the data after projection.
+	 * number of queue's must be equal to number of worker backends.
+	 */
+	currentShmScanDesc = shm_beginscan(node->num_workers);
+	workerResult = ExecInitWorkerResult(parallelscanstate->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor);
+
+	parallelscanstate->pss_currentShmScanDesc = currentShmScanDesc;
+	parallelscanstate->pss_workerResult	= workerResult;
+
+	return parallelscanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelSeqScan(node)
+ *
+ *		Scans the relation sequentially from multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecParallelSeqScan(ParallelSeqScanState *node)
+{
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) ParallelSeqNext,
+					(ExecScanRecheckMtd) ParallelSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndParallelSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndParallelSeqScan(ParallelSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	/* destroy parallel context. */
+	DestroyParallelContext(node->pcxt);
+
+	ExitParallelMode();
+}
+
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..5780df0 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -139,6 +139,22 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 									 0,
 									 NULL);
 
+	/*
+	 * set the scan limits, if requested by plan.  If the end block
+	 * is not specified, then scan all the blocks till end.
+	 */
+	if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber &&
+		((SeqScan *) node->ps.plan)->endblock != InvalidBlockNumber)
+		heap_setscanlimits(currentScanDesc,
+						   ((SeqScan *) node->ps.plan)->startblock,
+						   (((SeqScan *) node->ps.plan)->endblock -
+						   ((SeqScan *) node->ps.plan)->startblock));
+	else if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber)
+			 heap_setscanlimits(currentScanDesc,
+								((SeqScan *) node->ps.plan)->startblock,
+								(currentScanDesc->rs_nblocks -
+								((SeqScan *) node->ps.plan)->startblock));
+
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
 
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index f12f2d5..6998e00 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -22,9 +22,13 @@
 
 static shm_mq *pq_mq;
 static shm_mq_handle *pq_mq_handle;
+static shm_mq *err_pq_mq = NULL;
+static shm_mq_handle *err_pq_mq_handle = NULL;
 static bool pq_mq_busy = false;
 static pid_t pq_mq_parallel_master_pid = 0;
 static pid_t pq_mq_parallel_master_backend_id = InvalidBackendId;
+static pid_t save_pq_mq_parallel_master_pid = 0;
+static pid_t save_pq_mq_parallel_master_backend_id = InvalidBackendId;
 
 static void mq_comm_reset(void);
 static int	mq_flush(void);
@@ -60,6 +64,30 @@ pq_redirect_to_shm_mq(shm_mq *mq, shm_mq_handle *mqh)
 	FrontendProtocol = PG_PROTOCOL_LATEST;
 }
 
+void
+pq_save_shm_mq_info(shm_mq *mq, shm_mq_handle *mqh)
+{
+	err_pq_mq = mq;
+	err_pq_mq_handle = mqh;
+}
+
+void
+pq_redirect_to_err_shm_mq(void)
+{
+	Assert(err_pq_mq != NULL);
+	PqCommMethods = &PqCommMqMethods;
+	pq_mq = err_pq_mq;
+	pq_mq_handle = err_pq_mq_handle;
+	whereToSendOutput = DestRemote;
+	FrontendProtocol = PG_PROTOCOL_LATEST;
+}
+
+bool
+is_err_shm_mq_enabled(void)
+{
+	return err_pq_mq ? true : false;
+}
+
 /*
  * Arrange to SendProcSignal() to the parallel master each time we transmit
  * message data via the shm_mq.
@@ -72,6 +100,21 @@ pq_set_parallel_master(pid_t pid, BackendId backend_id)
 	pq_mq_parallel_master_backend_id = backend_id;
 }
 
+void
+pq_save_parallel_master_info(pid_t pid, BackendId backend_id)
+{
+	save_pq_mq_parallel_master_pid = pid;
+	save_pq_mq_parallel_master_backend_id = backend_id;
+}
+
+void
+pq_set_parallel_master_from_info(void)
+{
+	Assert(PqCommMethods == &PqCommMqMethods);
+	pq_mq_parallel_master_pid = save_pq_mq_parallel_master_pid;
+	pq_mq_parallel_master_backend_id = save_pq_mq_parallel_master_backend_id;
+}
+
 static void
 mq_comm_reset(void)
 {
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 020558b..dedce1f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -106,6 +106,8 @@ int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +221,63 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_parallelseqscan
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+	double		spc_seq_page_cost;
+	QualCost	qpqual_cost;
+	Cost		cpu_per_tuple;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	if (!enable_seqscan)
+		startup_cost += disable_cost;
+
+	/* fetch estimated page cost for tablespace containing table */
+	get_tablespace_page_costs(baserel->reltablespace,
+							  NULL,
+							  &spc_seq_page_cost);
+
+	/*
+	 * disk costs
+	 */
+	run_cost += spc_seq_page_cost * baserel->pages;
+
+	/* CPU costs */
+	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
+
+	startup_cost += qpqual_cost.startup;
+	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+	run_cost += cpu_per_tuple * baserel->tuples;
+
+	/*
+	 * We simply assume that cost will be equally shared by parallel
+	 * workers which might not be true especially for doing disk access.
+	 * XXX - We would like to change these values based on some concrete
+	 * tests.
+	 */
+	path->path.startup_cost = startup_cost / nWorkers;
+	path->path.total_cost = (startup_cost + run_cost) / nWorkers;
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..5245652
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,126 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+
+
+/*
+ *	IsTargetListContainNonVars -
+ *		Check if target list contain non-var entries.
+ */
+static bool
+IsTargetListContainNonVars(List *targetlist)
+{
+	ListCell   *l;
+
+	foreach(l, targetlist)
+	{
+		TargetEntry *te = (TargetEntry *) lfirst(l);
+
+		if (!IsA(te, TargetEntry))
+			continue;			/* probably should never happen */
+		if (!IsA(te->expr, Var))
+			return true;
+	}
+	return false;
+}
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/*
+	 * parallel scan is not supported for joins.
+	 */
+	if (root->simple_rel_array_size > 2)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	/*
+	 * parallel scan is not supported for non-var target list.
+	 *
+	 * XXX - This is to keep the implementation simple, we can do this
+	 * in future.  Here we are checking by passing root->parse->targetList
+	 * instead of rel->reltargetlist because rel->targetlist always contains
+	 * Vars (refer build_base_rel_tlists).
+	 */
+	if (IsTargetListContainNonVars(root->parse->targetList))
+	   return;
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	add_path(rel, (Path *) create_parallelseqscan_path(root, rel,
+													   num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 655be81..1c7f640 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,9 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_parallelseqscan_plan(PlannerInfo *root,
+										 ParallelSeqPath *best_path,
+										 List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +103,9 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static ParallelSeqScan *make_parallelseqscan(List *qptlist, List *qpqual,
+											 Index scanrelid, int nworkers,
+											 BlockNumber nblocksperworker);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +234,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +350,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_ParallelSeqScan:
+			plan = (Plan *) create_parallelseqscan_plan(root,
+														(ParallelSeqPath *) best_path,
+														tlist,
+														scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -1133,6 +1147,71 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_worker_seqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by worker
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock)
+{
+	SeqScan    *scan_plan;
+
+	/*
+	 * Pass scan_relid as 1, this is okay for now as sequence scan worker
+	 * is allowed to operate on just one relation.
+	 * XXX - we should ideally get scanrelid from master backend.
+	 */
+	scan_plan = make_seqscan(targetList,
+							 scan_clauses,
+							 1);
+
+	scan_plan->startblock = startBlock;
+	scan_plan->endblock = endBlock;
+	return scan_plan;
+}
+
+/*
+ * create_parallelseqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by 'best_path'
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+static Scan *
+create_parallelseqscan_plan(PlannerInfo *root, ParallelSeqPath *best_path,
+					List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->path.param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_parallelseqscan(tlist,
+											  scan_clauses,
+											  scan_relid,
+											  best_path->num_workers,
+											  best_path->num_blocks_per_worker);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3314,6 +3393,30 @@ make_seqscan(List *qptlist,
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->scanrelid = scanrelid;
+	node->startblock = InvalidBlockNumber;
+	node->endblock = InvalidBlockNumber;
+
+	return node;
+}
+
+static ParallelSeqScan *
+make_parallelseqscan(List *qptlist,
+			   List *qpqual,
+			   Index scanrelid,
+			   int nworkers,
+			   BlockNumber nblocksperworker)
+{
+	ParallelSeqScan *node = makeNode(ParallelSeqScan);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+	node->num_blocks_per_worker = nblocksperworker;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9cbbcfb..6c8c3f0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,59 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+/*
+ * create_worker_seqscan_plannedstmt
+ *	Returns a planned statement to be used by worker for execution.
+ *	Ideally, master backend should form worker's planned statement
+ *	and pass the same to worker, however for now  master backend
+ *	just passes the required information and PlannedStmt is then
+ *	constructed by worker.
+ */
+PlannedStmt	*
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt)
+{
+	AclMode		required_access = ACL_SELECT;
+	RangeTblEntry *rte;
+	SeqScan    *scan_plan;
+	PlannedStmt	*result;
+
+	rte = makeNode(RangeTblEntry);
+	rte->rtekind = RTE_RELATION;
+	rte->relid = workerstmt->relId;
+	rte->relkind = 'r';
+	rte->requiredPerms = required_access;
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) workerstmt->qual);
+
+	scan_plan = create_worker_seqscan_plan(workerstmt->targetList,
+										   workerstmt->qual,
+										   workerstmt->startBlock,
+										   workerstmt->endBlock);
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) scan_plan;
+	result->rtable = list_make1(rte);
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->relationOids = lappend_oid(result->relationOids, rte->relid);;
+	result->invalItems = NIL;
+	result->nParamExec = 0;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 7703946..3a44aef 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -436,6 +436,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..2ca1707 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,37 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_parallelseqscan_path
+ *	  Creates a path corresponding to a parallel sequential scan, returning the
+ *	  pathnode.
+ */
+ParallelSeqPath *
+create_parallelseqscan_path(PlannerInfo *root, RelOptInfo *rel, int nWorkers)
+{
+	ParallelSeqPath	   *pathnode = makeNode(ParallelSeqPath);
+
+	pathnode->path.pathtype = T_ParallelSeqScan;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->num_workers = nWorkers;
+	/*
+	 * Divide the work equally among all the workers, for cases
+	 * where division is not equal (example if there are total
+	 * 10 blocks and 3 workers, then as per below calculation each
+	 * worker will scan 3 blocks), last worker will be responsible for
+	 * scanning remaining blocks (refer exec_worker_message).
+	 */
+	pathnode->num_blocks_per_worker = rel->pages / nWorkers;
+
+	cost_parallelseqscan(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..028f34e
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,226 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitiateWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/parallel.h"
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PG_WORKER_KEY_RELID			0
+#define PG_WORKER_KEY_TARGETLIST	1
+#define PG_WORKER_KEY_QUAL			2
+#define PG_WORKER_KEY_BLOCKS		3
+#define PARALLEL_KEY_TUPLE_QUEUE	4
+
+void exec_worker_message(dsm_segment *seg, shm_toc *toc);
+
+/*
+ * InitiateWorkers
+ *		It sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitiateWorkers(Oid relId, List *targetList, List *qual,
+				shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+				BlockNumber numBlocksPerWorker, int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	int			i;
+	Size		targetlist_len, qual_len;
+	BlockNumber	*num_blocks_per_worker;
+	Oid		   *reliddata;
+	char	   *targetlistdata;
+	char	   *targetlist_str;
+	char	   *qualdata;
+	char	   *qual_str;
+	char	   *tuple_queue_space;
+	ParallelContext *pcxt;
+	shm_mq	   *mq;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(exec_worker_message, nWorkers);
+
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(relId));
+
+	targetlist_str = nodeToString(targetList);
+	targetlist_len = strlen(targetlist_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, targetlist_len);
+
+	qual_str = nodeToString(qual);
+	qual_len = strlen(qual_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, qual_len);
+
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(BlockNumber));
+
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * nWorkers);
+
+	/* 5 keys for parallel seq. scan specific data. */
+	shm_toc_estimate_keys(&pcxt->estimator, 5);
+
+	InitializeParallelDSM(pcxt);
+
+	/* Store scan relation id in dynamic shared memory. */
+	reliddata = shm_toc_allocate(pcxt->toc, sizeof(Oid));
+	*reliddata = relId;
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_RELID, reliddata);
+
+	/* Store target list in dynamic shared memory. */
+	targetlistdata = shm_toc_allocate(pcxt->toc, targetlist_len);
+	memcpy(targetlistdata, targetlist_str, targetlist_len);
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_TARGETLIST, targetlistdata);
+
+	/* Store qual list in dynamic shared memory. */
+	qualdata = shm_toc_allocate(pcxt->toc, qual_len);
+	memcpy(qualdata, qual_str, qual_len);
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_QUAL, qualdata);
+
+	/* Store blocks to be scanned by each worker in dynamic shared memory. */
+	num_blocks_per_worker = shm_toc_allocate(pcxt->toc, sizeof(BlockNumber));
+	*num_blocks_per_worker = numBlocksPerWorker;
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_BLOCKS, num_blocks_per_worker);
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(nWorkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data. 
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+
+	/* Register backend workers. */
+	LaunchParallelWorkers(pcxt);
+
+	for (i = 0; i < pcxt->nworkers; ++i)
+		shm_mq_set_handle((*responseqp)[i], pcxt->worker[i].bgwhandle);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+
+/*
+ * exec_worker_message
+ *
+ * Execute the work assigned to a worker by master backend.
+ */
+void
+exec_worker_message(dsm_segment *seg, shm_toc *toc)
+{
+	char	    *targetlistdata;
+	char		*qualdata;
+	char		*tuple_queue_space;
+	BlockNumber *num_blocks_per_worker;
+	BlockNumber  start_block;
+	BlockNumber  end_block;
+	shm_mq	    *mq;
+	shm_mq_handle *responseq;
+	FixedParallelState *fps;
+	Oid			*relId;
+	List		*targetList = NIL;
+	List		*qual = NIL;
+	worker_stmt	*workerstmt;
+	
+	fps = shm_toc_lookup(toc, PARALLEL_KEY_FIXED);
+	relId = shm_toc_lookup(toc, PG_WORKER_KEY_RELID);
+	targetlistdata = shm_toc_lookup(toc, PG_WORKER_KEY_TARGETLIST);
+	qualdata = shm_toc_lookup(toc, PG_WORKER_KEY_QUAL);
+	num_blocks_per_worker = shm_toc_lookup(toc, PG_WORKER_KEY_BLOCKS);
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(mq, MyProc);
+	responseq = shm_mq_attach(mq, seg, NULL);
+
+	end_block = (ParallelWorkerNumber + 1) * (*num_blocks_per_worker);
+	start_block = end_block - (*num_blocks_per_worker);
+
+	/* Redirect protocol messages to responseq. */
+	pq_redirect_to_shm_mq(mq, responseq);
+
+	/* Restore targetList and qual from main backend. */
+	targetList = (List *) stringToNode(targetlistdata);
+	qual = (List *) stringToNode(qualdata);
+
+	workerstmt = palloc(sizeof(worker_stmt));
+
+	workerstmt->relId = *relId;
+	workerstmt->targetList = targetList;
+	workerstmt->qual = qual;
+	workerstmt->startBlock = start_block;
+
+	/*
+	 * last worker should scan all the remaining blocks.
+	 *
+	 * XXX - It is possible that expected number of workers
+	 * won't get started, so to handle such cases master
+	 * backend should scan remaining blocks.
+	 */
+	if ((ParallelWorkerNumber + 1) == fps->workers_expected)
+		workerstmt->endBlock = InvalidBlockNumber;
+	else
+		workerstmt->endBlock = end_block;
+
+	/* Execute the worker command. */
+	exec_worker_stmt(workerstmt);
+
+	/* Report success. */
+	ReadyForQuery(DestRemote);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 65d5fac..6c7d89a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7b1e8f6..d345d4c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -56,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1133,6 +1134,100 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * execute_worker_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_worker_stmt(worker_stmt *workerstmt)
+{
+	Portal		portal;
+	int16		format = 1;
+	DestReceiver *receiver;
+	bool		isTopLevel = true;
+	PlannedStmt	*planned_stmt;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+	BeginCommand("SELECT", DestNone);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	planned_stmt = create_worker_seqscan_plannedstmt(workerstmt);
+	/*
+	 * Create unnamed portal to run the query or queries in. If there
+	 * already is one, silently drop it.
+	 */
+	portal = CreatePortal("", true, true);
+	/* Don't display the portal in pg_cursors */
+	portal->visible = false;
+
+	/*
+	 * We don't have to copy anything into the portal, because everything
+	 * we are passing here is in MessageContext, which will outlive the
+	 * portal anyway.
+	 */
+	PortalDefineQuery(portal,
+					  NULL,
+					  "",
+					  "",
+					  list_make1(planned_stmt),
+					  NULL);
+
+	/*
+	 * Start the portal.  No parameters here.
+	 */
+	PortalStart(portal, NULL, 0, InvalidSnapshot);
+
+	/* We always use binary format, for efficiency. */
+	PortalSetResultFormat(portal, 1, &format);
+
+	receiver = CreateDestReceiver(DestRemote);
+	SetRemoteDestReceiverParams(receiver, portal);
+
+	/*
+	 * Only once the portal and destreceiver have been established can
+	 * we return to the transaction context.  All that stuff needs to
+	 * survive an internal commit inside PortalRun!
+	 */
+	MemoryContextSwitchTo(oldcontext);
+
+	/*
+	 * Run the portal to completion, and then drop it (and the receiver).
+	 */
+	(void) PortalRun(portal,
+					 FETCH_ALL,
+					 isTopLevel,
+					 receiver,
+					 receiver,
+					 NULL);
+
+	(*receiver->rDestroy) (receiver);
+
+	PortalDrop(portal, false);
+
+	/*
+	 * Send appropriate CommandComplete to client.  There is no
+	 * need to send completion tag from worker as that won't be
+	 * of any use considering the completiong tag of master backend
+	 * will be used for sending to client.
+	 */
+	EndCommand("", DestRemote);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 13395e3..a373f6b 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -67,6 +67,7 @@
 #include "access/xact.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
+#include "libpq/pqmq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "postmaster/postmaster.h"
@@ -236,6 +237,13 @@ errstart(int elevel, const char *filename, int lineno,
 	bool		output_to_client = false;
 	int			i;
 
+	/* redirect errors to error shared memory queue. */
+	if (is_err_shm_mq_enabled() && elevel >= ERROR)
+	{
+		pq_redirect_to_err_shm_mq();
+		pq_set_parallel_master_from_info();
+	}
+
 	/*
 	 * Check some cases in which we want to promote an error into a more
 	 * severe error.  None of this logic applies for non-error messages.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d9bfa25..9319f65 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -630,6 +630,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2445,6 +2447,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..50f7a27 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -497,6 +497,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index b651218..c50dd7b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -41,8 +41,47 @@ typedef struct ParallelContext
 	ParallelWorkerInfo *worker;
 } ParallelContext;
 
+/* Magic number for parallel context TOC. */
+#define PARALLEL_MAGIC						0x50477c7c
+
+/*
+ * Magic numbers for parallel state sharing.  Higher-level code should use
+ * smaller values, leaving these very large ones for use by this module.
+ */
+#define PARALLEL_KEY_FIXED					UINT64CONST(0xFFFFFFFFFFFF0001)
+#define PARALLEL_KEY_ERROR_QUEUE			UINT64CONST(0xFFFFFFFFFFFF0002)
+#define PARALLEL_KEY_GUC					UINT64CONST(0xFFFFFFFFFFFF0003)
+#define PARALLEL_KEY_COMBO_CID				UINT64CONST(0xFFFFFFFFFFFF0004)
+#define PARALLEL_KEY_TRANSACTION_SNAPSHOT	UINT64CONST(0xFFFFFFFFFFFF0005)
+#define PARALLEL_KEY_ACTIVE_SNAPSHOT		UINT64CONST(0xFFFFFFFFFFFF0006)
+#define PARALLEL_KEY_TRANSACTION_STATE		UINT64CONST(0xFFFFFFFFFFFF0007)
+#define PARALLEL_KEY_EXTENSION_TRAMPOLINE	UINT64CONST(0xFFFFFFFFFFFF0008)
+
+/* Fixed-size parallel state. */
+typedef struct FixedParallelState
+{
+	/* Fixed-size state that workers must restore. */
+	Oid			database_id;
+	Oid			authenticated_user_id;
+	Oid			current_user_id;
+	int			sec_context;
+	PGPROC	   *parallel_master_pgproc;
+	pid_t		parallel_master_pid;
+	BackendId	parallel_master_backend_id;
+
+	/* Entrypoint for parallel workers. */
+	parallel_worker_main_type	entrypoint;
+
+	/* Track whether workers have attached. */
+	slock_t		mutex;
+	int			workers_expected;
+	int			workers_attached;
+} FixedParallelState;
+
 extern bool ParallelMessagePending;
 
+extern int ParallelWorkerNumber;
+
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExtension(char *library_name,
 								  char *function_name, int nworkers);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9bb6362..bde6df0 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,7 +20,6 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
-
 typedef struct HeapScanDescData
 {
 	/* scan parameters */
@@ -105,4 +104,13 @@ typedef struct SysScanDescData
 	Snapshot	snapshot;		/* snapshot to unregister at end of scan */
 }	SysScanDescData;
 
+/* struct for scanning shared memory queues */
+typedef struct ShmScanDescData
+{
+	/* scan current state */
+	int			num_shm_queues;	/* number of shared memory queues used in scan. */
+	int			ss_cqueue;		/* current queue # in scan, if any */
+	bool		shmscan_inited;		/* false = scan not init'd yet */
+}	ShmScanDescData;
+
 #endif   /* RELSCAN_H */
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..aa444bc
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	FmgrInfo   *receive_functions;
+	Oid		   *typioparams;
+	bool		has_row_description;
+	bool		complete;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+typedef struct ShmScanDescData *ShmScanDesc;
+
+extern worker_result ExecInitWorkerResult(TupleDesc tupdesc);
+extern ShmScanDesc shm_beginscan(int num_queues);
+extern HeapTuple shm_getnext(ShmScanDesc shmScan, worker_result resultState,
+							 shm_mq_handle **responseq, TupleDesc tupdesc);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/nodeParallelSeqscan.h b/src/include/executor/nodeParallelSeqscan.h
new file mode 100644
index 0000000..b638a24
--- /dev/null
+++ b/src/include/executor/nodeParallelSeqscan.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeparallelSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeParallelSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARALLELSEQSCAN_H
+#define NODEPARALLELSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern ParallelSeqScanState *ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecParallelSeqScan(ParallelSeqScanState *node);
+extern void ExecEndParallelSeqScan(ParallelSeqScanState *node);
+
+extern Size EstimateScanRelationIdSpace(Oid relId);
+extern void SerializeScanRelationId(Oid relId, Size maxsize,
+									char *start_address);
+extern void RestoreScanRelationId(Oid *relId, char *start_address);
+
+extern Size EstimateTargetListSpace(List *targetList);
+extern void SerializeTargetList(List *targetList, Size maxsize,
+								char *start_address);
+extern void RestoreTargetList(List **targetList, char *start_address);
+
+#endif   /* NODEPARALLELSEQSCAN_H */
diff --git a/src/include/libpq/pqmq.h b/src/include/libpq/pqmq.h
index ad7589d..2186d60 100644
--- a/src/include/libpq/pqmq.h
+++ b/src/include/libpq/pqmq.h
@@ -17,7 +17,13 @@
 #include "storage/shm_mq.h"
 
 extern void	pq_redirect_to_shm_mq(shm_mq *, shm_mq_handle *);
+extern void pq_save_shm_mq_info(shm_mq *mq, shm_mq_handle *mqh);
+extern void pq_redirect_to_err_shm_mq(void);
+extern bool is_err_shm_mq_enabled(void);
+
 extern void pq_set_parallel_master(pid_t pid, BackendId backend_id);
+extern void pq_save_parallel_master_info(pid_t pid, BackendId backend_id);
+extern void pq_set_parallel_master_from_info(void);
 
 extern void pq_parse_errornotice(StringInfo str, ErrorData *edata);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41288ed..a7263bd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,12 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -1021,6 +1024,9 @@ typedef struct PlanState
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 	bool		ps_TupFromTlist;/* state flag for processing set-valued
 								 * functions in targetlist */
+	bool		qualPushed;		/* indicates that qual is pushed to backend
+								 * worker, so no need to evaluate it after
+								 * getting the tuple in main backend. */
 } PlanState;
 
 /* ----------------
@@ -1212,6 +1218,23 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * ParallelScanState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct ParallelScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle **responseq;
+	ShmScanDesc pss_currentShmScanDesc;
+	worker_result	pss_workerResult;
+} ParallelScanState;
+
+typedef ParallelScanState ParallelSeqScanState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 97ef0fc..b6f1493 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_ParallelSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +98,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_ParallelSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +219,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_ParallelSeqPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b1dfa85..5777271 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -23,6 +23,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +157,15 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for execution. */
+typedef struct worker_stmt
+{
+	Oid			relId;
+	List		*targetList;
+	List		*qual;
+	BlockNumber startBlock;
+	BlockNumber endBlock;
+} worker_stmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 316c9ce..3354398 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,7 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -269,6 +270,8 @@ typedef struct Scan
 {
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+	BlockNumber startblock;		/* block to start seq scan */
+	BlockNumber endblock;		/* block upto which scan has to be done */
 } Scan;
 
 /* ----------------
@@ -278,6 +281,17 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct ParallelSeqScan
+{
+	Scan		scan;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqScan;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..576add5 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct ParallelSeqPath
+{
+	Path		path;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..b1161bd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +69,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..32c3e0d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern ParallelSeqPath *create_parallelseqscan_path(PlannerInfo *root,
+					RelOptInfo *rel, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 082f7d7..ef5a320 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -41,6 +41,9 @@ extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  * prototypes for plan/createplan.c
  */
 extern Plan *create_plan(PlannerInfo *root, Path *best_path);
+extern SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..91ddffe 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..8813b6d
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+extern int	parallel_seqscan_degree;
+extern void InitiateWorkers(Oid relId, List *targetList,
+							List *qual,
+							shm_mq_handle ***responseqp,
+							ParallelContext **pcxtp,
+							BlockNumber numBlocksPerWorker,
+							int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 0a350fd..02cf518 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -83,5 +83,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_worker_stmt(worker_stmt *workerstmt);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#50Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stephen Frost (#47)
Re: Parallel Seq Scan

On 1/5/15, 9:21 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

I think it's right to view this in the same way we view work_mem. We
plan on the assumption that an amount of memory equal to work_mem will
be available at execution time, without actually reserving it.

Agreed- this seems like a good approach for how to address this. We
should still be able to end up with plans which use less than the max
possible parallel workers though, as I pointed out somewhere up-thread.
This is also similar to work_mem- we certainly have plans which don't
expect to use all of work_mem and others that expect to use all of it
(per node, of course).

I agree, but we should try and warn the user if they set parallel_seqscan_degree close to max_worker_processes, or at least give some indication of what's going on. This is something you could end up beating your head on wondering why it's not working.

Perhaps we could have EXPLAIN throw a warning if a plan is likely to get less than parallel_seqscan_degree number of workers.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Stephen Frost
sfrost@snowman.net
In reply to: Jim Nasby (#50)
Re: Parallel Seq Scan

* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:

On 1/5/15, 9:21 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

I think it's right to view this in the same way we view work_mem. We
plan on the assumption that an amount of memory equal to work_mem will
be available at execution time, without actually reserving it.

Agreed- this seems like a good approach for how to address this. We
should still be able to end up with plans which use less than the max
possible parallel workers though, as I pointed out somewhere up-thread.
This is also similar to work_mem- we certainly have plans which don't
expect to use all of work_mem and others that expect to use all of it
(per node, of course).

I agree, but we should try and warn the user if they set parallel_seqscan_degree close to max_worker_processes, or at least give some indication of what's going on. This is something you could end up beating your head on wondering why it's not working.

Perhaps we could have EXPLAIN throw a warning if a plan is likely to get less than parallel_seqscan_degree number of workers.

Yeah, if we come up with a plan for X workers and end up not being able
to spawn that many then I could see that being worth a warning or notice
or something. Not sure what EXPLAIN has to do anything with it..

Thanks,

Stephen

#52Amit Kapila
amit.kapila16@gmail.com
In reply to: Jim Nasby (#50)
Re: Parallel Seq Scan

On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 1/5/15, 9:21 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

I think it's right to view this in the same way we view work_mem. We
plan on the assumption that an amount of memory equal to work_mem will
be available at execution time, without actually reserving it.

Agreed- this seems like a good approach for how to address this. We
should still be able to end up with plans which use less than the max
possible parallel workers though, as I pointed out somewhere up-thread.
This is also similar to work_mem- we certainly have plans which don't
expect to use all of work_mem and others that expect to use all of it
(per node, of course).

I agree, but we should try and warn the user if they set

parallel_seqscan_degree close to max_worker_processes, or at least give
some indication of what's going on. This is something you could end up
beating your head on wondering why it's not working.

Yet another way to handle the case when enough workers are not
available is to let user specify the desired minimum percentage of
requested parallel workers with parameter like
PARALLEL_QUERY_MIN_PERCENT. For example, if you specify
50 for this parameter, then at least 50% of the parallel workers
requested for any parallel operation must be available in order for
the operation to succeed else it will give error. If the value is set to
null, then all parallel operations will proceed as long as at least two
parallel workers are available for processing.

This is something how other commercial database handles such a
situation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#24)
Re: Parallel Seq Scan

On Fri, Dec 19, 2014 at 7:57 PM, Stephen Frost <sfrost@snowman.net> wrote:

There's certainly documentation available from the other RDBMS' which
already support parallel query, as one source. Other academic papers
exist (and once you've linked into one, the references and prior work
helps bring in others). Sadly, I don't currently have ACM access (might
have to change that..), but there are publicly available papers also,

I have gone through couple of papers and what some other databases
do in case of parallel sequential scan and here is brief summarization
of same and how I am planning to handle in the patch:

Costing:
In one of the paper's [1]http://i.stanford.edu/pub/cstr/reports/cs/tr/96/1570/CS-TR-96-1570.pdf suggested by you, below is the summarisation:
a. Startup costs are negligible if processes can be reused
rather than created afresh.
b. Communication cost consists of the CPU cost of sending
and receiving messages.
c. Communication costs can exceed the cost of operators such
as scanning, joining or grouping
These findings lead to the important conclusion that
Query optimization should be concerned with communication costs
but not with startup costs.

In our case as currently we don't have a mechanism to reuse parallel
workers, so we need to account for that cost as well. So based on that,
I am planing to add three new parameters cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost
* cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
to master backend with default value
DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
with tuples expected to be selected
* parallel_setup_cost - Cost of setting up shared memory for parallelism
with default value as 100.0
* parallel_startup_cost - Cost of starting up parallel workers with
default
value as 1000.0 multiplied by number of workers decided for scan.

I will do some experiments to finalise the default values, but in general,
I feel developing cost model on above parameters is good.

Execution:
Most other databases does partition level scan for partition on
different disks by each individual parallel worker. However,
it seems amazon dynamodb [2]http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan also works on something
similar to what I have used in patch which means on fixed
blocks. I think this kind of strategy seems better than dividing
the blocks at runtime because dividing randomly the blocks
among workers could lead to random scan for a parallel
sequential scan.
Also I find in whatever I have read (Oracle, dynamodb) that most
databases divide work among workers and master backend acts
as coordinator, atleast that's what I could understand.

Let me know your opinion about the same?

I am planning to proceed with above ideas to strengthen the patch
in absence of any objection or better ideas.

[1]: http://i.stanford.edu/pub/cstr/reports/cs/tr/96/1570/CS-TR-96-1570.pdf
[2]: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#54Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#53)
Re: Parallel Seq Scan

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Fri, Dec 19, 2014 at 7:57 PM, Stephen Frost <sfrost@snowman.net> wrote:

There's certainly documentation available from the other RDBMS' which
already support parallel query, as one source. Other academic papers
exist (and once you've linked into one, the references and prior work
helps bring in others). Sadly, I don't currently have ACM access (might
have to change that..), but there are publicly available papers also,

I have gone through couple of papers and what some other databases
do in case of parallel sequential scan and here is brief summarization
of same and how I am planning to handle in the patch:

Great, thanks!

Costing:
In one of the paper's [1] suggested by you, below is the summarisation:
a. Startup costs are negligible if processes can be reused
rather than created afresh.
b. Communication cost consists of the CPU cost of sending
and receiving messages.
c. Communication costs can exceed the cost of operators such
as scanning, joining or grouping
These findings lead to the important conclusion that
Query optimization should be concerned with communication costs
but not with startup costs.

In our case as currently we don't have a mechanism to reuse parallel
workers, so we need to account for that cost as well. So based on that,
I am planing to add three new parameters cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost
* cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
to master backend with default value
DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
with tuples expected to be selected
* parallel_setup_cost - Cost of setting up shared memory for parallelism
with default value as 100.0
* parallel_startup_cost - Cost of starting up parallel workers with
default
value as 1000.0 multiplied by number of workers decided for scan.

I will do some experiments to finalise the default values, but in general,
I feel developing cost model on above parameters is good.

The parameters sound reasonable but I'm a bit worried about the way
you're describing the implementation. Specifically this comment:

"Cost of starting up parallel workers with default value as 1000.0
multiplied by number of workers decided for scan."

That appears to imply that we'll decide on the number of workers, figure
out the cost, and then consider "parallel" as one path and
"not-parallel" as another. I'm worried that if I end up setting the max
parallel workers to 32 for my big, beefy, mostly-single-user system then
I'll actually end up not getting parallel execution because we'll always
be including the full startup cost of 32 threads. For huge queries,
it'll probably be fine, but there's a lot of room to parallelize things
at levels less than 32 which we won't even consider.

What I was advocating for up-thread was to consider multiple "parallel"
paths and to pick whichever ends up being the lowest overall cost. The
flip-side to that is increased planning time. Perhaps we can come up
with an efficient way of working out where the break-point is based on
the non-parallel cost and go at it from that direction instead of
building out whole paths for each increment of parallelism.

I'd really like to be able to set the 'max parallel' high and then have
the optimizer figure out how many workers should actually be spawned for
a given query.

Execution:
Most other databases does partition level scan for partition on
different disks by each individual parallel worker. However,
it seems amazon dynamodb [2] also works on something
similar to what I have used in patch which means on fixed
blocks. I think this kind of strategy seems better than dividing
the blocks at runtime because dividing randomly the blocks
among workers could lead to random scan for a parallel
sequential scan.

Yeah, we also need to consider the i/o side of this, which will
definitely be tricky. There are i/o systems out there which are faster
than a single CPU and ones where a single CPU can manage multiple i/o
channels. There are also cases where the i/o system handles sequential
access nearly as fast as random and cases where sequential is much
faster than random. Where we can get an idea of that distinction is
with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
lower random_page_cost from the default to indicate that.

Also I find in whatever I have read (Oracle, dynamodb) that most
databases divide work among workers and master backend acts
as coordinator, atleast that's what I could understand.

Yeah, I agree that's more typical. Robert's point that the master
backend should participate is interesting but, as I recall, it was based
on the idea that the master could finish faster than the worker- but if
that's the case then we've planned it out wrong from the beginning.

Thanks!

Stephen

#55Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#52)
Re: Parallel Seq Scan

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

I agree, but we should try and warn the user if they set
parallel_seqscan_degree close to max_worker_processes, or at least give
some indication of what's going on. This is something you could end up
beating your head on wondering why it's not working.

Yet another way to handle the case when enough workers are not
available is to let user specify the desired minimum percentage of
requested parallel workers with parameter like
PARALLEL_QUERY_MIN_PERCENT. For example, if you specify
50 for this parameter, then at least 50% of the parallel workers
requested for any parallel operation must be available in order for
the operation to succeed else it will give error. If the value is set to
null, then all parallel operations will proceed as long as at least two
parallel workers are available for processing.

Ugh. I'm not a fan of this.. Based on how we're talking about modeling
this, if we decide to parallelize at all, then we expect it to be a win.
I don't like the idea of throwing an error if, at execution time, we end
up not being able to actually get the number of workers we want-
instead, we should degrade gracefully all the way back to serial, if
necessary. Perhaps we should send a NOTICE or something along those
lines to let the user know we weren't able to get the level of
parallelization that the plan originally asked for, but I really don't
like just throwing an error.

Now, for debugging purposes, I could see such a parameter being
available but it should default to 'off/never-fail'.

Thanks,

Stephen

#56Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Stephen Frost (#55)
Re: Parallel Seq Scan

On 01/09/2015 08:01 PM, Stephen Frost wrote:

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

I agree, but we should try and warn the user if they set
parallel_seqscan_degree close to max_worker_processes, or at least give
some indication of what's going on. This is something you could end up
beating your head on wondering why it's not working.

Yet another way to handle the case when enough workers are not
available is to let user specify the desired minimum percentage of
requested parallel workers with parameter like
PARALLEL_QUERY_MIN_PERCENT. For example, if you specify
50 for this parameter, then at least 50% of the parallel workers
requested for any parallel operation must be available in order for
the operation to succeed else it will give error. If the value is set to
null, then all parallel operations will proceed as long as at least two
parallel workers are available for processing.

Ugh. I'm not a fan of this.. Based on how we're talking about modeling
this, if we decide to parallelize at all, then we expect it to be a win.
I don't like the idea of throwing an error if, at execution time, we end
up not being able to actually get the number of workers we want-
instead, we should degrade gracefully all the way back to serial, if
necessary. Perhaps we should send a NOTICE or something along those
lines to let the user know we weren't able to get the level of
parallelization that the plan originally asked for, but I really don't
like just throwing an error.

yeah this seems like the the behaviour I would expect, if we cant get
enough parallel workers we should just use as much as we can get.
Everything else and especially erroring out will just cause random
application failures and easy DoS vectors.
I think all we need initially is being able to specify a "maximum number
of workers per query" as well as a "maximum number of workers in total
for parallel operations".

Now, for debugging purposes, I could see such a parameter being
available but it should default to 'off/never-fail'.

not sure what it really would be useful for - if I execute a query I
would truely expect it to get answered - if it can be made faster if
done in parallel thats nice but why would I want it to fail?

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Stephen Frost
sfrost@snowman.net
In reply to: Stefan Kaltenbrunner (#56)
Re: Parallel Seq Scan

* Stefan Kaltenbrunner (stefan@kaltenbrunner.cc) wrote:

On 01/09/2015 08:01 PM, Stephen Frost wrote:

Now, for debugging purposes, I could see such a parameter being
available but it should default to 'off/never-fail'.

not sure what it really would be useful for - if I execute a query I
would truely expect it to get answered - if it can be made faster if
done in parallel thats nice but why would I want it to fail?

I was thinking for debugging only, though I'm not really sure why you'd
need it if you get a NOTICE when you don't end up with all the workers
you expect.

Thanks,

Stephen

#58Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stephen Frost (#57)
Re: Parallel Seq Scan

On 1/9/15, 3:34 PM, Stephen Frost wrote:

* Stefan Kaltenbrunner (stefan@kaltenbrunner.cc) wrote:

On 01/09/2015 08:01 PM, Stephen Frost wrote:

Now, for debugging purposes, I could see such a parameter being
available but it should default to 'off/never-fail'.

not sure what it really would be useful for - if I execute a query I
would truely expect it to get answered - if it can be made faster if
done in parallel thats nice but why would I want it to fail?

I was thinking for debugging only, though I'm not really sure why you'd
need it if you get a NOTICE when you don't end up with all the workers
you expect.

Yeah, debugging is my concern as well. You're working on a query, you expect it to be using parallelism, and EXPLAIN is showing it's not. Now you're scratching your head.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stephen Frost (#54)
Re: Parallel Seq Scan

On 1/9/15, 11:24 AM, Stephen Frost wrote:

What I was advocating for up-thread was to consider multiple "parallel"
paths and to pick whichever ends up being the lowest overall cost. The
flip-side to that is increased planning time. Perhaps we can come up
with an efficient way of working out where the break-point is based on
the non-parallel cost and go at it from that direction instead of
building out whole paths for each increment of parallelism.

I think at some point we'll need the ability to stop planning part-way through for queries producing really small estimates. If the first estimate you get is 1000 units, does it really make sense to do something like try every possible join permutation, or attempt to parallelize?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#54)
Re: Parallel Seq Scan

On Fri, Jan 9, 2015 at 10:54 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Amit Kapila (amit.kapila16@gmail.com) wrote:

In our case as currently we don't have a mechanism to reuse parallel
workers, so we need to account for that cost as well. So based on that,
I am planing to add three new parameters cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost
* cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
to master backend with default value
DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
with tuples expected to be selected
* parallel_setup_cost - Cost of setting up shared memory for

parallelism

with default value as 100.0
* parallel_startup_cost - Cost of starting up parallel workers with
default
value as 1000.0 multiplied by number of workers decided for scan.

I will do some experiments to finalise the default values, but in

general,

I feel developing cost model on above parameters is good.

The parameters sound reasonable but I'm a bit worried about the way
you're describing the implementation. Specifically this comment:

"Cost of starting up parallel workers with default value as 1000.0
multiplied by number of workers decided for scan."

That appears to imply that we'll decide on the number of workers, figure
out the cost, and then consider "parallel" as one path and
"not-parallel" as another. I'm worried that if I end up setting the max
parallel workers to 32 for my big, beefy, mostly-single-user system then
I'll actually end up not getting parallel execution because we'll always
be including the full startup cost of 32 threads. For huge queries,
it'll probably be fine, but there's a lot of room to parallelize things
at levels less than 32 which we won't even consider.

Actually the main factor to decide whether a parallel plan will be
selected or not will be based on selectivity and cpu_tuple_comm_cost,
parallel_startup_cost is mainly to prevent the cases where user
has set parallel_seqscan_degree, but the table is small enough
(letus say 10,000 tuples) that it doesn't need parallelism. If you are
worried by default cost parameter's, then I think those still needs
to be decided based on certain experiments.

What I was advocating for up-thread was to consider multiple "parallel"
paths and to pick whichever ends up being the lowest overall cost. The
flip-side to that is increased planning time.

The main idea behind providing a parameter like parallel_seqscan_degree
is such that it will try to use that many number of workers for a single
parallel operation (intra-node parallelism) and incase we have to perform
inter-node parallelism than having such an parameter means that each
node can use that many number of parallel worker. For example we have
to parallelize scan as well as sort (Select * from t1 order by c1), and
parallel_degree is specified as 2, then each of the scan and sort can use
2 parallel workers each.

This is somewhat similar to the concept how degree of parallelism (DOP)
works in other databases. Refer case of Oracle [1]http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqoconce.htm (Setting Degree of
Parallelism).

I don't deny the fact that it will be a idea worth exploring to make
optimizer
more smart for deciding parallel plans, but it seems to me it is an advanced
topic which will be more valuable when we will try to parallelize joins or
other
similar stuff and even most papers talk about it in those regards only.
At this moment if we can ensure that parallel plan should not be selected
for cases where it will perform poorly is more than enough considering
we have lots of other work left to even make any parallel operation work.

Perhaps we can come up
with an efficient way of working out where the break-point is based on
the non-parallel cost and go at it from that direction instead of
building out whole paths for each increment of parallelism.

I'd really like to be able to set the 'max parallel' high and then have
the optimizer figure out how many workers should actually be spawned for
a given query.

Execution:
Most other databases does partition level scan for partition on
different disks by each individual parallel worker. However,
it seems amazon dynamodb [2] also works on something
similar to what I have used in patch which means on fixed
blocks. I think this kind of strategy seems better than dividing
the blocks at runtime because dividing randomly the blocks
among workers could lead to random scan for a parallel
sequential scan.

Yeah, we also need to consider the i/o side of this, which will
definitely be tricky. There are i/o systems out there which are faster
than a single CPU and ones where a single CPU can manage multiple i/o
channels. There are also cases where the i/o system handles sequential
access nearly as fast as random and cases where sequential is much
faster than random. Where we can get an idea of that distinction is
with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
lower random_page_cost from the default to indicate that.

I am not clear, do you expect anything different in execution strategy
than what I have mentioned or does that sound reasonable to you?

[1]: http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqoconce.htm

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#61Amit Kapila
amit.kapila16@gmail.com
In reply to: Stefan Kaltenbrunner (#56)
Re: Parallel Seq Scan

On Sat, Jan 10, 2015 at 2:45 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:

On 01/09/2015 08:01 PM, Stephen Frost wrote:

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com>

wrote:

I agree, but we should try and warn the user if they set
parallel_seqscan_degree close to max_worker_processes, or at least

give

some indication of what's going on. This is something you could end up
beating your head on wondering why it's not working.

Yet another way to handle the case when enough workers are not
available is to let user specify the desired minimum percentage of
requested parallel workers with parameter like
PARALLEL_QUERY_MIN_PERCENT. For example, if you specify
50 for this parameter, then at least 50% of the parallel workers
requested for any parallel operation must be available in order for
the operation to succeed else it will give error. If the value is set

to

null, then all parallel operations will proceed as long as at least two
parallel workers are available for processing.

Now, for debugging purposes, I could see such a parameter being
available but it should default to 'off/never-fail'.

not sure what it really would be useful for - if I execute a query I
would truely expect it to get answered - if it can be made faster if
done in parallel thats nice but why would I want it to fail?

One usecase where I could imagine it to be useful is when the
query is going to take many hours if run sequentially and it could
be finished in minutes if run with 16 parallel workers, now let us
say during execution if there are less than 30% of parallel workers
available it might not be acceptable to user and he would like to
rather wait for some time and again run the query and if he wants
to run query even if 2 workers are available, he can choose not
to such a parameter.

Having said that, I also feel this doesn't seem to be an important case
to introduce a new parameter and such a behaviour. I have mentioned,
because it came across my eyes how some other databases handle
such a situation. Lets forget this suggestion if we can't imagine any
use of such a parameter.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#62Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#60)
Re: Parallel Seq Scan

* Amit Kapila (amit.kapila16@gmail.com) wrote:

At this moment if we can ensure that parallel plan should not be selected
for cases where it will perform poorly is more than enough considering
we have lots of other work left to even make any parallel operation work.

The problem with this approach is that it doesn't consider any options
between 'serial' and 'parallelize by factor X'. If the startup cost is
1000 and the factor is 32, then a seqscan which costs 31000 won't ever
be parallelized, even though a factor of 8 would have parallelized it.

You could forget about the per-process startup cost entirely, in fact,
and simply say "only parallelize if it's more than X".

Again, I don't like the idea of designing this with the assumption that
the user dictates the right level of parallelization for each and every
query. I'd love to go out and tell users "set the factor to the number
of CPUs you have and we'll just use what makes sense."

The same goes for max number of backends. If we set the parallel level
to the number of CPUs and set the max backends to the same, then we end
up with only one parallel query running at a time, ever. That's
terrible. Now, we could set the parallel level lower or set the max
backends higher, but either way we're going to end up either using less
than we could or over-subscribing, neither of which is good.

I agree that this makes it a bit different from work_mem, but in this
case there's an overall max in the form of the maximum number of
background workers. If we had something similar for work_mem, then we
could set that higher and still trust the system to only use the amount
of memory necessary (eg: a hashjoin doesn't use all available work_mem
and neither does a sort, unless the set is larger than available
memory).

Execution:
Most other databases does partition level scan for partition on
different disks by each individual parallel worker. However,
it seems amazon dynamodb [2] also works on something
similar to what I have used in patch which means on fixed
blocks. I think this kind of strategy seems better than dividing
the blocks at runtime because dividing randomly the blocks
among workers could lead to random scan for a parallel
sequential scan.

Yeah, we also need to consider the i/o side of this, which will
definitely be tricky. There are i/o systems out there which are faster
than a single CPU and ones where a single CPU can manage multiple i/o
channels. There are also cases where the i/o system handles sequential
access nearly as fast as random and cases where sequential is much
faster than random. Where we can get an idea of that distinction is
with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
lower random_page_cost from the default to indicate that.

I am not clear, do you expect anything different in execution strategy
than what I have mentioned or does that sound reasonable to you?

What I'd like is a way to figure out the right amount of CPU for each
tablespace (0.25, 1, 2, 4, etc) and then use that many. Using a single
CPU for each tablespace is likely to starve the CPU or starve the I/O
system and I'm not sure if there's a way to address that.

Note that I intentionally said tablespace there because that's how users
can tell us what the different i/o channels are. I realize this ends up
going beyond the current scope, but the parallel seqscan at the per
relation level will only ever be using one i/o channel. It'd be neat if
we could work out how fast that i/o channel is vs. the CPUs and
determine how many CPUs are necessary to keep up with the i/o channel
and then use more-or-less exactly that many for the scan.

I agree that some of this can come later but I worry that starting out
with a design that expects to always be told exactly how many CPUs to
use when running a parallel query will be difficult to move away from
later.

Thanks,

Stephen

#63Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#48)
Re: Parallel Seq Scan

On Thu, Jan 8, 2015 at 6:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Are we sure that in such cases we will consume work_mem during
execution? In cases of parallel_workers we are sure to an extent
that if we reserve the workers then we will use it during execution.
Nonetheless, I have proceded and integrated the parallel_seq scan
patch with v0.3 of parallel_mode patch posted by you at below link:
/messages/by-id/CA+TgmoYmp_=XcJEhvJZt9P8drBgW-pDpjHxBhZA79+M4o-CZQA@mail.gmail.com

That depends on the costing model. It makes no sense to do a parallel
sequential scan on a small relation, because the user backend can scan
the whole thing itself faster than the workers can start up. I
suspect it may also be true that the useful amount of parallelism
increases the larger the relation gets (but maybe not).

2. To enable two types of shared memory queue's (error queue and
tuple queue), we need to ensure that we switch to appropriate queue
during communication of various messages from parallel worker
to master backend. There are two ways to do it
a. Save the information about error queue during startup of parallel
worker (ParallelMain()) and then during error, set the same (switch
to error queue in errstart() and switch back to tuple queue in
errfinish() and errstart() in case errstart() doesn't need to
propagate
error).
b. Do something similar as (a) for tuple queue in printtup or other
place
if any for non-error messages.
I think approach (a) is slightly better as compare to approach (b) as
we need to switch many times for tuple queue (for each tuple) and
there could be multiple places where we need to do the same. For now,
I have used approach (a) in Patch which needs some more work if we
agree on the same.

I don't think you should be "switching" queues. The tuples should be
sent to the tuple queue, and errors and notices to the error queue.

3. As per current implementation of Parallel_seqscan, it needs to use
some information from parallel.c which was not exposed, so I have
exposed the same by moving it to parallel.h. Information that is required
is as follows:
ParallelWorkerNumber, FixedParallelState and shm keys -
This is used to decide the blocks that needs to be scanned.
We might change it in future the way parallel scan/work distribution
is done, but I don't see any harm in exposing this information.

Hmm. I can see why ParallelWorkerNumber might need to be exposed, but
the other stuff seems like it shouldn't be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#51)
Re: Parallel Seq Scan

On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:

Yeah, if we come up with a plan for X workers and end up not being able
to spawn that many then I could see that being worth a warning or notice
or something. Not sure what EXPLAIN has to do anything with it..

That seems mighty odd to me. If there are 8 background worker
processes available, and you allow each session to use at most 4, then
when there are >2 sessions trying to do parallelism at the same time,
they might not all get their workers. Emitting a notice for that
seems like it would be awfully chatty.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#54)
Re: Parallel Seq Scan

On Fri, Jan 9, 2015 at 12:24 PM, Stephen Frost <sfrost@snowman.net> wrote:

The parameters sound reasonable but I'm a bit worried about the way
you're describing the implementation. Specifically this comment:

"Cost of starting up parallel workers with default value as 1000.0
multiplied by number of workers decided for scan."

That appears to imply that we'll decide on the number of workers, figure
out the cost, and then consider "parallel" as one path and
"not-parallel" as another. [...]
I'd really like to be able to set the 'max parallel' high and then have
the optimizer figure out how many workers should actually be spawned for
a given query.

+1.

Yeah, we also need to consider the i/o side of this, which will
definitely be tricky. There are i/o systems out there which are faster
than a single CPU and ones where a single CPU can manage multiple i/o
channels. There are also cases where the i/o system handles sequential
access nearly as fast as random and cases where sequential is much
faster than random. Where we can get an idea of that distinction is
with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
lower random_page_cost from the default to indicate that.

On my MacOS X system, I've already seen cases where my parallel_count
module runs incredibly slowly some of the time. I believe that this
is because having multiple workers reading the relation block-by-block
at the same time causes the OS to fail to realize that it needs to do
aggressive readahead. I suspect we're going to need to account for
this somehow.

Yeah, I agree that's more typical. Robert's point that the master
backend should participate is interesting but, as I recall, it was based
on the idea that the master could finish faster than the worker- but if
that's the case then we've planned it out wrong from the beginning.

So, if the workers have been started but aren't keeping up, the master
should do nothing until they produce tuples rather than participating?
That doesn't seem right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#63)
Re: Parallel Seq Scan

On Sun, Jan 11, 2015 at 9:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 8, 2015 at 6:42 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

2. To enable two types of shared memory queue's (error queue and
tuple queue), we need to ensure that we switch to appropriate queue
during communication of various messages from parallel worker
to master backend. There are two ways to do it
a. Save the information about error queue during startup of parallel
worker (ParallelMain()) and then during error, set the same

(switch

to error queue in errstart() and switch back to tuple queue in
errfinish() and errstart() in case errstart() doesn't need to
propagate
error).
b. Do something similar as (a) for tuple queue in printtup or other
place
if any for non-error messages.
I think approach (a) is slightly better as compare to approach (b) as
we need to switch many times for tuple queue (for each tuple) and
there could be multiple places where we need to do the same. For now,
I have used approach (a) in Patch which needs some more work if we
agree on the same.

I don't think you should be "switching" queues. The tuples should be
sent to the tuple queue, and errors and notices to the error queue.

To achieve what you said (The tuples should be sent to the tuple
queue, and errors and notices to the error queue.), we need to
switch the queues.
The difficulty here is that once we set the queue (using
pq_redirect_to_shm_mq()) through which the communication has to
happen, it will use the same unless we change again the queue
using pq_redirect_to_shm_mq(). For example, assume we have
initially set error queue (using pq_redirect_to_shm_mq()) then to
send tuples, we need to call pq_redirect_to_shm_mq() to
set the tuple queue as the queue that needs to be used for communication
and again if error happens then we need to do the same for error
queue.
Do you have any other idea to achieve the same?

3. As per current implementation of Parallel_seqscan, it needs to use
some information from parallel.c which was not exposed, so I have
exposed the same by moving it to parallel.h. Information that is

required

is as follows:
ParallelWorkerNumber, FixedParallelState and shm keys -
This is used to decide the blocks that needs to be scanned.
We might change it in future the way parallel scan/work distribution
is done, but I don't see any harm in exposing this information.

Hmm. I can see why ParallelWorkerNumber might need to be exposed, but
the other stuff seems like it shouldn't be.

It depends upon how we decide to achieve the scan of blocks
by backend worker. In current form, the patch needs to know
if myworker is the last worker (and I have used workers_expected
to achieve the same, I know that is not the right thing but I need
something similar if we decide to do in the way I have proposed),
so that it can scan all the remaining blocks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#67Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#64)
Re: Parallel Seq Scan

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:

Yeah, if we come up with a plan for X workers and end up not being able
to spawn that many then I could see that being worth a warning or notice
or something. Not sure what EXPLAIN has to do anything with it..

That seems mighty odd to me. If there are 8 background worker
processes available, and you allow each session to use at most 4, then
when there are >2 sessions trying to do parallelism at the same time,
they might not all get their workers. Emitting a notice for that
seems like it would be awfully chatty.

Yeah, agreed, it could get quite noisy. Did you have another thought
for how to address the concern raised? Specifically, that you might not
get as many workers as you thought you would?

Thanks,

Stephen

#68Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#65)
Re: Parallel Seq Scan

* Robert Haas (robertmhaas@gmail.com) wrote:

On Fri, Jan 9, 2015 at 12:24 PM, Stephen Frost <sfrost@snowman.net> wrote:

Yeah, we also need to consider the i/o side of this, which will
definitely be tricky. There are i/o systems out there which are faster
than a single CPU and ones where a single CPU can manage multiple i/o
channels. There are also cases where the i/o system handles sequential
access nearly as fast as random and cases where sequential is much
faster than random. Where we can get an idea of that distinction is
with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
lower random_page_cost from the default to indicate that.

On my MacOS X system, I've already seen cases where my parallel_count
module runs incredibly slowly some of the time. I believe that this
is because having multiple workers reading the relation block-by-block
at the same time causes the OS to fail to realize that it needs to do
aggressive readahead. I suspect we're going to need to account for
this somehow.

So, for my 2c, I've long expected us to parallelize at the relation-file
level for these kinds of operations. This goes back to my other
thoughts on how we should be thinking about parallelizing inbound data
for bulk data loads but it seems appropriate to consider it here also.
One of the issues there is that 1G still feels like an awful lot for a
minimum work size for each worker and it would mean we don't parallelize
for relations less than that size.

On a random VM on my personal server, an uncached 1G read takes over
10s. Cached it's less than half that, of course. This is all spinning
rust (and only 7200 RPM at that) and there's a lot of other stuff going
on but that still seems like too much of a chunk to give to one worker
unless the overall data set to go through is really large.

There's other issues in there too, of course, if we're dumping data in
like this then we have to either deal with jagged relation files somehow
or pad the file out to 1G, and that doesn't even get into the issues
around how we'd have to redesign the interfaces for relation access and
how this thinking is an utter violation of the modularity we currently
have there.

Yeah, I agree that's more typical. Robert's point that the master
backend should participate is interesting but, as I recall, it was based
on the idea that the master could finish faster than the worker- but if
that's the case then we've planned it out wrong from the beginning.

So, if the workers have been started but aren't keeping up, the master
should do nothing until they produce tuples rather than participating?
That doesn't seem right.

Having the master jump in and start working could screw things up also
though. Perhaps we need the master to start working as a fail-safe but
not plan on having things go that way? Having more processes trying to
do X doesn't always result in things getting better and the master needs
to keep up with all the tuples being thrown at it from the workers.

Thanks,

Stephen

#69Stephen Frost
sfrost@snowman.net
In reply to: Amit Kapila (#66)
Re: Parallel Seq Scan

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Sun, Jan 11, 2015 at 9:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think you should be "switching" queues. The tuples should be
sent to the tuple queue, and errors and notices to the error queue.

Agreed.

To achieve what you said (The tuples should be sent to the tuple
queue, and errors and notices to the error queue.), we need to
switch the queues.
The difficulty here is that once we set the queue (using
pq_redirect_to_shm_mq()) through which the communication has to
happen, it will use the same unless we change again the queue
using pq_redirect_to_shm_mq(). For example, assume we have
initially set error queue (using pq_redirect_to_shm_mq()) then to
send tuples, we need to call pq_redirect_to_shm_mq() to
set the tuple queue as the queue that needs to be used for communication
and again if error happens then we need to do the same for error
queue.
Do you have any other idea to achieve the same?

I think what Robert's getting at here is that pq_redirect_to_shm_mq()
might be fine for the normal data heading back, but we need something
separate for errors and notices. Switching everything back and forth
between the normal and error queues definitely doesn't sound right to
me- they need to be independent.

In other words, you need to be able to register a "normal data" queue
and then you need to also register a "error/notice" queue and have
errors and notices sent there directly. Going off of what I recall,
can't this be done by having the callbacks which are registered for
sending data back look at what they're being asked to send and then
decide which queue it's appropriate for out of the set which have been
registered so far?

Thanks,

Stephen

#70Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Stephen Frost (#67)
Re: Parallel Seq Scan

On 01/11/2015 11:27 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:

Yeah, if we come up with a plan for X workers and end up not being able
to spawn that many then I could see that being worth a warning or notice
or something. Not sure what EXPLAIN has to do anything with it..

That seems mighty odd to me. If there are 8 background worker
processes available, and you allow each session to use at most 4, then
when there are >2 sessions trying to do parallelism at the same time,
they might not all get their workers. Emitting a notice for that
seems like it would be awfully chatty.

Yeah, agreed, it could get quite noisy. Did you have another thought
for how to address the concern raised? Specifically, that you might not
get as many workers as you thought you would?

Wild idea: What about dealing with it as some sort of statistic - ie
track some global counts in the stats collector or on a per-query base
in pg_stat_activity and/or through pg_stat_statements?

Not sure why it is that important to get it on a per-query base, imho it
is simply a configuration limit we have set (similiar to work_mem or
when switching to geqo) - we dont report "per query" through
notice/warning there either (though the effect is kind visible in explain).

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#66)
Re: Parallel Seq Scan

On Sat, Jan 10, 2015 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I don't think you should be "switching" queues. The tuples should be
sent to the tuple queue, and errors and notices to the error queue.

To achieve what you said (The tuples should be sent to the tuple
queue, and errors and notices to the error queue.), we need to
switch the queues.
The difficulty here is that once we set the queue (using
pq_redirect_to_shm_mq()) through which the communication has to
happen, it will use the same unless we change again the queue
using pq_redirect_to_shm_mq(). For example, assume we have
initially set error queue (using pq_redirect_to_shm_mq()) then to
send tuples, we need to call pq_redirect_to_shm_mq() to
set the tuple queue as the queue that needs to be used for communication
and again if error happens then we need to do the same for error
queue.
Do you have any other idea to achieve the same?

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other of
which gets used by printtup.c for tuples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#67)
Re: Parallel Seq Scan

On Sun, Jan 11, 2015 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:

Yeah, if we come up with a plan for X workers and end up not being able
to spawn that many then I could see that being worth a warning or notice
or something. Not sure what EXPLAIN has to do anything with it..

That seems mighty odd to me. If there are 8 background worker
processes available, and you allow each session to use at most 4, then
when there are >2 sessions trying to do parallelism at the same time,
they might not all get their workers. Emitting a notice for that
seems like it would be awfully chatty.

Yeah, agreed, it could get quite noisy. Did you have another thought
for how to address the concern raised? Specifically, that you might not
get as many workers as you thought you would?

I'm not sure why that's a condition in need of special reporting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#68)
Re: Parallel Seq Scan

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:

So, for my 2c, I've long expected us to parallelize at the relation-file
level for these kinds of operations. This goes back to my other
thoughts on how we should be thinking about parallelizing inbound data
for bulk data loads but it seems appropriate to consider it here also.
One of the issues there is that 1G still feels like an awful lot for a
minimum work size for each worker and it would mean we don't parallelize
for relations less than that size.

Yes, I think that's a killer objection.

[ .. ] and
how this thinking is an utter violation of the modularity we currently
have there.

As is that.

My thinking is more along the lines that we might need to issue
explicit prefetch requests when doing a parallel sequential scan, to
make up for any failure of the OS to do that for us.

So, if the workers have been started but aren't keeping up, the master
should do nothing until they produce tuples rather than participating?
That doesn't seem right.

Having the master jump in and start working could screw things up also
though.

I don't think there's any reason why that should screw things up.
There's no reason why the master's participation should look any
different from one more worker. Look at my parallel_count code on the
other thread to see what I mean: the master and all the workers are
running the same code, and if fewer worker show up than expected, or
run unduly slowly, it's easily tolerated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#69)
Re: Parallel Seq Scan

On Sun, Jan 11, 2015 at 6:09 AM, Stephen Frost <sfrost@snowman.net> wrote:

I think what Robert's getting at here is that pq_redirect_to_shm_mq()
might be fine for the normal data heading back, but we need something
separate for errors and notices. Switching everything back and forth
between the normal and error queues definitely doesn't sound right to
me- they need to be independent.

You've got that backwards. pq_redirect_to_shm_mq() handles errors and
notices, but we need something separate for the tuple stream.

In other words, you need to be able to register a "normal data" queue
and then you need to also register a "error/notice" queue and have
errors and notices sent there directly. Going off of what I recall,
can't this be done by having the callbacks which are registered for
sending data back look at what they're being asked to send and then
decide which queue it's appropriate for out of the set which have been
registered so far?

It's pretty simple, really. The functions that need to use the tuple
queue are in printtup.c; those, and only those, need to be modified to
write to the other queue.

Or, possibly, we should pass the tuples around in their native format
instead of translating them into binary form and then reconstituting
them on the other end.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#73)
Re: Parallel Seq Scan

On Mon, Jan 12, 2015 at 3:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:

So, if the workers have been started but aren't keeping up, the master
should do nothing until they produce tuples rather than participating?
That doesn't seem right.

Having the master jump in and start working could screw things up also
though.

I don't think there's any reason why that should screw things up.

Consider the case of inter-node parallelism, in such cases master
backend will have 4 responsibilities (scan relation, receive tuples
from other workers, send tuples to other workers, send tuples to
frontend) if we make it act like a worker.

For example
Select * from t1 Order By c1;

Now here first it needs to perform parallel sequential scan and then
fed the tuples from scan to another parallel worker which is doing sort.

It seems to me that master backend could starve few resources doing
all the work in an optimized way. As an example, one case could be
master backend read one page in memory (shared buffers) and then
read one tuple and apply the qualification and in the mean time the
queues on which it needs to receive got filled and it becomes busy
fetching tuples from those queues, now the page which it has read from
disk will be pinned in shared buffers for a longer time and even if we
release such a page, it has to be read again. OTOH, if master backend
would choose to read all the tuples from a page before checking the status
of queues, it can lead to lot of data piled up in queues.
I think there can be more such scenarios where getting many things
done by master backend can turn out to have negative impact.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#76Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#72)
Re: Parallel Seq Scan

On Mon, Jan 12, 2015 at 3:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 11, 2015 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net>

wrote:

Yeah, if we come up with a plan for X workers and end up not being

able

to spawn that many then I could see that being worth a warning or

notice

or something. Not sure what EXPLAIN has to do anything with it..

That seems mighty odd to me. If there are 8 background worker
processes available, and you allow each session to use at most 4, then
when there are >2 sessions trying to do parallelism at the same time,
they might not all get their workers. Emitting a notice for that
seems like it would be awfully chatty.

Yeah, agreed, it could get quite noisy. Did you have another thought
for how to address the concern raised? Specifically, that you might not
get as many workers as you thought you would?

I'm not sure why that's a condition in need of special reporting.

So what should happen if no workers are available?
I don't think we can change the plan to a non-parallel at that
stage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#77Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#72)
Re: Parallel Seq Scan

On 1/11/15 3:57 PM, Robert Haas wrote:

On Sun, Jan 11, 2015 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:

Yeah, if we come up with a plan for X workers and end up not being able
to spawn that many then I could see that being worth a warning or notice
or something. Not sure what EXPLAIN has to do anything with it..

That seems mighty odd to me. If there are 8 background worker
processes available, and you allow each session to use at most 4, then
when there are >2 sessions trying to do parallelism at the same time,
they might not all get their workers. Emitting a notice for that
seems like it would be awfully chatty.

Yeah, agreed, it could get quite noisy. Did you have another thought
for how to address the concern raised? Specifically, that you might not
get as many workers as you thought you would?

I'm not sure why that's a condition in need of special reporting.

The case raised before (that I think is valid) is: what if you have a query that is massively parallel. You expect it to get 60 cores on the server and take 10 minutes. Instead it gets 10 and takes an hour (or worse, 1 and takes 10 hours).

Maybe it's not worth dealing with that in the first version, but I expect it will come up very quickly. We better make sure we're not painting ourselves in a corner.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78John Gorman
johngorman2@gmail.com
In reply to: Robert Haas (#73)
Re: Parallel Seq Scan

On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:

So, for my 2c, I've long expected us to parallelize at the relation-file
level for these kinds of operations. This goes back to my other
thoughts on how we should be thinking about parallelizing inbound data
for bulk data loads but it seems appropriate to consider it here also.
One of the issues there is that 1G still feels like an awful lot for a
minimum work size for each worker and it would mean we don't parallelize
for relations less than that size.

Yes, I think that's a killer objection.

One approach that I has worked well for me is to break big jobs into much
smaller bite size tasks. Each task is small enough to complete quickly.

We add the tasks to a task queue and spawn a generic worker pool which eats
through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.
- No need to split big jobs perfectly.
- We don't get into a situation where we are waiting around for a worker to
finish chugging through a huge task while the other workers sit idle.
- Worker memory footprint is tiny so we can afford many of them.
- Worker pool management is a well known problem.
- Worker spawn time disappears as a cost factor.
- The worker pool becomes a shared resource that can be managed and
reported on and becomes considerably more predictable.

#79John Gorman
johngorman2@gmail.com
In reply to: John Gorman (#78)
Re: Parallel Seq Scan

On Tue, Jan 13, 2015 at 7:25 AM, John Gorman <johngorman2@gmail.com> wrote:

On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net>
wrote:

So, for my 2c, I've long expected us to parallelize at the relation-file
level for these kinds of operations. This goes back to my other
thoughts on how we should be thinking about parallelizing inbound data
for bulk data loads but it seems appropriate to consider it here also.
One of the issues there is that 1G still feels like an awful lot for a
minimum work size for each worker and it would mean we don't parallelize
for relations less than that size.

Yes, I think that's a killer objection.

One approach that I has worked well for me is to break big jobs into much
smaller bite size tasks. Each task is small enough to complete quickly.

We add the tasks to a task queue and spawn a generic worker pool which
eats through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.
- No need to split big jobs perfectly.
- We don't get into a situation where we are waiting around for a worker
to finish chugging through a huge task while the other workers sit idle.
- Worker memory footprint is tiny so we can afford many of them.
- Worker pool management is a well known problem.
- Worker spawn time disappears as a cost factor.
- The worker pool becomes a shared resource that can be managed and
reported on and becomes considerably more predictable.

I forgot to mention that a running task queue can provide metrics such as
current utilization, current average throughput, current queue length and
estimated queue wait time. These can become dynamic cost factors in
deciding whether to parallelize.

#80Amit Kapila
amit.kapila16@gmail.com
In reply to: John Gorman (#78)
Re: Parallel Seq Scan

On Tue, Jan 13, 2015 at 4:55 PM, John Gorman <johngorman2@gmail.com> wrote:

On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net>

wrote:

So, for my 2c, I've long expected us to parallelize at the

relation-file

level for these kinds of operations. This goes back to my other
thoughts on how we should be thinking about parallelizing inbound data
for bulk data loads but it seems appropriate to consider it here also.
One of the issues there is that 1G still feels like an awful lot for a
minimum work size for each worker and it would mean we don't

parallelize

for relations less than that size.

Yes, I think that's a killer objection.

One approach that I has worked well for me is to break big jobs into much

smaller bite size tasks. Each task is small enough to complete quickly.

Here we have to decide what should be the strategy and how much
each worker should scan. As an example one of the the strategy
could be if the table size is X MB and there are 8 workers, then
divide the work as X/8 MB for each worker (which I have currently
used in patch) and another could be each worker does scan
1 block at a time and then check some global structure to see which
next block it needs to scan, according to me this could lead to random
scan. I have read that some other databases also divide the work
based on partitions or segments (size of segment is not very clear).

We add the tasks to a task queue and spawn a generic worker pool which

eats through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.
- No need to split big jobs perfectly.
- We don't get into a situation where we are waiting around for a worker

to finish chugging through a huge task while the other workers sit idle.

- Worker memory footprint is tiny so we can afford many of them.
- Worker pool management is a well known problem.
- Worker spawn time disappears as a cost factor.
- The worker pool becomes a shared resource that can be managed and

reported on and becomes considerably more predictable.

Yeah, it is good idea to maintain shared worker pool, but it seems
to me that for initial version even if the workers are not shared,
then also it is meaningful to make parallel sequential scan work.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#81Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Kapila (#80)
Re: Parallel Seq Scan

On Wed, Jan 14, 2015 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Jan 13, 2015 at 4:55 PM, John Gorman <johngorman2@gmail.com>
wrote:

On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net>

wrote:

So, for my 2c, I've long expected us to parallelize at the

relation-file

level for these kinds of operations. This goes back to my other
thoughts on how we should be thinking about parallelizing inbound data
for bulk data loads but it seems appropriate to consider it here also.
One of the issues there is that 1G still feels like an awful lot for a
minimum work size for each worker and it would mean we don't

parallelize

for relations less than that size.

Yes, I think that's a killer objection.

One approach that I has worked well for me is to break big jobs into

much smaller bite size tasks. Each task is small enough to complete quickly.

Here we have to decide what should be the strategy and how much
each worker should scan. As an example one of the the strategy
could be if the table size is X MB and there are 8 workers, then
divide the work as X/8 MB for each worker (which I have currently
used in patch) and another could be each worker does scan
1 block at a time and then check some global structure to see which
next block it needs to scan, according to me this could lead to random
scan. I have read that some other databases also divide the work
based on partitions or segments (size of segment is not very clear).

A block can contain useful tuples, i.e tuples which are visible and fulfil
the quals + useless tuples i.e. tuples which are dead, invisible or that do
not fulfil the quals. Depending upon the contents of these blocks, esp. the
ratio of (useful tuples)/(unuseful tuples), even though we divide the
relation into equal sized runs, each worker may take different time. So,
instead of dividing the relation into number of run = number of workers, it
might be better to divide them into fixed sized runs with size < (total
number of blocks/ number of workers), and let a worker pick up a run after
it finishes with the previous one. The smaller the size of runs the better
load balancing but higher cost of starting with the run. So, we have to
strike a balance.

We add the tasks to a task queue and spawn a generic worker pool which

eats through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.
- No need to split big jobs perfectly.
- We don't get into a situation where we are waiting around for a worker

to finish chugging through a huge task while the other workers sit idle.

- Worker memory footprint is tiny so we can afford many of them.
- Worker pool management is a well known problem.
- Worker spawn time disappears as a cost factor.
- The worker pool becomes a shared resource that can be managed and

reported on and becomes considerably more predictable.

Yeah, it is good idea to maintain shared worker pool, but it seems
to me that for initial version even if the workers are not shared,
then also it is meaningful to make parallel sequential scan work.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#82Robert Haas
robertmhaas@gmail.com
In reply to: John Gorman (#78)
Re: Parallel Seq Scan

On Tue, Jan 13, 2015 at 6:25 AM, John Gorman <johngorman2@gmail.com> wrote:

One approach that I has worked well for me is to break big jobs into much
smaller bite size tasks. Each task is small enough to complete quickly.

We add the tasks to a task queue and spawn a generic worker pool which eats
through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.
- No need to split big jobs perfectly.
- We don't get into a situation where we are waiting around for a worker to
finish chugging through a huge task while the other workers sit idle.
- Worker memory footprint is tiny so we can afford many of them.
- Worker pool management is a well known problem.
- Worker spawn time disappears as a cost factor.
- The worker pool becomes a shared resource that can be managed and reported
on and becomes considerably more predictable.

I think this is a good idea, but for now I would like to keep our
goals somewhat more modest: let's see if we can get parallel
sequential scan, and only parallel sequential scan, working and
committed. Ultimately, I think we may need something like what you're
talking about, because if you have a query with three or six or twelve
different parallelizable operations in it, you want the available CPU
resources to switch between those as their respective needs may
dictate. You certainly don't want to spawn a separate pool of workers
for each scan.

But I think getting that all working in the first version is probably
harder than what we should attempt. We have a bunch of problems to
solve here just around parallel sequential scan and the parallel mode
infrastructure: heavyweight locking, prefetching, the cost model, and
so on. Trying to add to that all of the problems that might attend on
a generic task queueing infrastructure fills me with no small amount
of fear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Amit Kapila (#80)
Re: Parallel Seq Scan

On 1/13/15 9:42 PM, Amit Kapila wrote:

As an example one of the the strategy
could be if the table size is X MB and there are 8 workers, then
divide the work as X/8 MB for each worker (which I have currently
used in patch) and another could be each worker does scan
1 block at a time and then check some global structure to see which
next block it needs to scan, according to me this could lead to random
scan. I have read that some other databases also divide the work
based on partitions or segments (size of segment is not very clear).

Long-term I think we'll want a mix between the two approaches. Simply doing something like blkno % num_workers is going to cause imbalances, but trying to do this on a per-block basis seems like too much overhead.

Also long-term, I think we also need to look at a more specialized version of parallelism at the IO layer. For example, during an index scan you'd really like to get IO requests for heap blocks started in the background while the backend is focused on the mechanics of the index scan itself. While this could be done with the stuff Robert has written I have to wonder if it'd be a lot more efficient to use fadvise or AIO. Or perhaps it would just be better to deal with an entire index page (remembering TIDs) and then hit the heap.

But I agree with Robert; there's a lot yet to be done just to get *any* kind of parallel execution working before we start thinking about how to optimize it.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#81)
Re: Parallel Seq Scan

On Wed, Jan 14, 2015 at 9:30 AM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jan 14, 2015 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Here we have to decide what should be the strategy and how much
each worker should scan. As an example one of the the strategy
could be if the table size is X MB and there are 8 workers, then
divide the work as X/8 MB for each worker (which I have currently
used in patch) and another could be each worker does scan
1 block at a time and then check some global structure to see which
next block it needs to scan, according to me this could lead to random
scan. I have read that some other databases also divide the work
based on partitions or segments (size of segment is not very clear).

A block can contain useful tuples, i.e tuples which are visible and

fulfil the quals + useless tuples i.e. tuples which are dead, invisible or
that do not fulfil the quals. Depending upon the contents of these blocks,
esp. the ratio of (useful tuples)/(unuseful tuples), even though we divide
the relation into equal sized runs, each worker may take different time.
So, instead of dividing the relation into number of run = number of
workers, it might be better to divide them into fixed sized runs with size
< (total number of blocks/ number of workers), and let a worker pick up a
run after it finishes with the previous one. The smaller the size of runs
the better load balancing but higher cost of starting with the run. So, we
have to strike a balance.

I think your suggestion is good and it somewhat falls inline
with what Robert has suggested, but instead of block-by-block,
you seem to be suggesting of doing it in chunks (where chunk size
is not clear), however the only point against this is that such a
strategy for parallel sequence scan could lead to random scans
which can hurt the operation badly. Nonetheless, I will think more
on this lines of making work distribution dynamic so that we can
ensure that all workers can be kept busy.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#85Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#71)
Re: Parallel Seq Scan

On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 10, 2015 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I don't think you should be "switching" queues. The tuples should be
sent to the tuple queue, and errors and notices to the error queue.

To achieve what you said (The tuples should be sent to the tuple
queue, and errors and notices to the error queue.), we need to
switch the queues.
The difficulty here is that once we set the queue (using
pq_redirect_to_shm_mq()) through which the communication has to
happen, it will use the same unless we change again the queue
using pq_redirect_to_shm_mq(). For example, assume we have
initially set error queue (using pq_redirect_to_shm_mq()) then to
send tuples, we need to call pq_redirect_to_shm_mq() to
set the tuple queue as the queue that needs to be used for communication
and again if error happens then we need to do the same for error
queue.
Do you have any other idea to achieve the same?

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other of
which gets used by printtup.c for tuples.

Okay, I will try to change the way as suggested without doing
switching, but this way we need to do it separately for 'T', 'D', and
'C' messages.

I have moved this patch to next CF as apart from above still I
have to work on execution strategy and optimizer related changes
as discussed in this thread

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#86Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#83)
Re: Parallel Seq Scan

On Wed, Jan 14, 2015 at 9:00 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Simply doing
something like blkno % num_workers is going to cause imbalances,

Yes.

but trying
to do this on a per-block basis seems like too much overhead.

...but no. Or at least, I doubt it. The cost of handing out blocks
one at a time is that, for each block, a worker's got to grab a
spinlock, increment and record the block number counter, and release
the spinlock. Or, use an atomic add. Now, it's true that spinlock
cycles and atomic ops can have sometimes impose severe overhead, but
you have to look at it as a percentage of the overall work being done.
In this case, the backend has to read, pin, and lock the page and
process every tuple on the page. Processing every tuple on the page
may involve de-TOASTing the tuple (leading to many more page
accesses), or evaluating a complex expression, or hitting CLOG to
check visibility, but even if it doesn't, I think the amount of work
that it takes to process all the tuples on the page will be far larger
than the cost of one atomic increment operation per block.

As mentioned downthread, a far bigger consideration is the I/O pattern
we create. A sequential scan is so-called because it reads the
relation sequentially. If we destroy that property, we will be more
than slightly sad. It might be OK to do sequential scans of, say,
each 1GB segment separately, but I'm pretty sure it would be a real
bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

What I'm thinking about is that we might have something like this:

struct this_lives_in_dynamic_shared_memory
{
BlockNumber last_block;
Size prefetch_distance;
Size prefetch_increment;
slock_t mutex;
BlockNumber next_prefetch_block;
BlockNumber next_scan_block;
};

Each worker takes the mutex and checks whether next_prefetch_block -
next_scan_block < prefetch_distance and also whether
next_prefetch_block < last_block. If both are true, it prefetches
some number of additional blocks, as specified by prefetch_increment.
Otherwise, it increments next_scan_block and scans the block
corresponding to the old value.

So in this way, the prefetching runs ahead of the scan by a
configurable amount (prefetch_distance), which should be chosen so
that the prefetches have time to compete before the scan actually
reaches those blocks. Right now, of course, we rely on the operating
system to prefetch for sequential scans, but I have a strong hunch
that may not work on all systems if there are multiple processes doing
the reads.

Now, what of other strategies like dividing up the relation into 1GB
chunks and reading each one in a separate process? We could certainly
DO that, but what advantage does it have over this? The only benefit
I can see is that you avoid accessing a data structure of the type
shown above for every block, but that only matters if that cost is
material, and I tend to think it won't be. On the flip side, it means
that the granularity for dividing up work between processes is now
very coarse - when there are less than 6GB of data left in a relation,
at most 6 processes can work on it. That might be OK if the data is
being read in from disk anyway, but it's certainly not the best we can
do when the data is in memory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#86)
Re: Parallel Seq Scan

On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

As mentioned downthread, a far bigger consideration is the I/O pattern
we create. A sequential scan is so-called because it reads the
relation sequentially. If we destroy that property, we will be more
than slightly sad. It might be OK to do sequential scans of, say,
each 1GB segment separately, but I'm pretty sure it would be a real
bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

What I'm thinking about is that we might have something like this:

struct this_lives_in_dynamic_shared_memory
{
BlockNumber last_block;
Size prefetch_distance;
Size prefetch_increment;
slock_t mutex;
BlockNumber next_prefetch_block;
BlockNumber next_scan_block;
};

Each worker takes the mutex and checks whether next_prefetch_block -
next_scan_block < prefetch_distance and also whether
next_prefetch_block < last_block. If both are true, it prefetches
some number of additional blocks, as specified by prefetch_increment.
Otherwise, it increments next_scan_block and scans the block
corresponding to the old value.

Assuming we will increment next_prefetch_block only after prefetching
blocks (equivalent to prefetch_increment), won't 2 workers can
simultaneously see the same value for next_prefetch_block and try to
perform prefetch for same blocks?

What will be value of prefetch_increment?
Will it be equal to prefetch_distance or prefetch_distance/2 or
prefetch_distance/4 or .. or will it be totally unrelated
to prefetch_distance?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#88Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#87)
Re: Parallel Seq Scan

On Fri, Jan 16, 2015 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

As mentioned downthread, a far bigger consideration is the I/O pattern
we create. A sequential scan is so-called because it reads the
relation sequentially. If we destroy that property, we will be more
than slightly sad. It might be OK to do sequential scans of, say,
each 1GB segment separately, but I'm pretty sure it would be a real
bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

What I'm thinking about is that we might have something like this:

struct this_lives_in_dynamic_shared_memory
{
BlockNumber last_block;
Size prefetch_distance;
Size prefetch_increment;
slock_t mutex;
BlockNumber next_prefetch_block;
BlockNumber next_scan_block;
};

Each worker takes the mutex and checks whether next_prefetch_block -
next_scan_block < prefetch_distance and also whether
next_prefetch_block < last_block. If both are true, it prefetches
some number of additional blocks, as specified by prefetch_increment.
Otherwise, it increments next_scan_block and scans the block
corresponding to the old value.

Assuming we will increment next_prefetch_block only after prefetching
blocks (equivalent to prefetch_increment), won't 2 workers can
simultaneously see the same value for next_prefetch_block and try to
perform prefetch for same blocks?

The idea is that you can only examine and modify next_prefetch_block
or next_scan_block while holding the mutex.

What will be value of prefetch_increment?
Will it be equal to prefetch_distance or prefetch_distance/2 or
prefetch_distance/4 or .. or will it be totally unrelated to
prefetch_distance?

I dunno, that might take some experimentation. prefetch_distance/2
doesn't sound stupid.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#88)
Re: Parallel Seq Scan

On Sat, Jan 17, 2015 at 10:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 16, 2015 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Assuming we will increment next_prefetch_block only after prefetching
blocks (equivalent to prefetch_increment), won't 2 workers can
simultaneously see the same value for next_prefetch_block and try to
perform prefetch for same blocks?

The idea is that you can only examine and modify next_prefetch_block
or next_scan_block while holding the mutex.

What will be value of prefetch_increment?
Will it be equal to prefetch_distance or prefetch_distance/2 or
prefetch_distance/4 or .. or will it be totally unrelated to
prefetch_distance?

I dunno, that might take some experimentation. prefetch_distance/2
doesn't sound stupid.

Okay, I think I got the idea what you want to achieve via
prefetching. So assuming prefetch_distance = 100 and
prefetch_increment = 50 (prefetch_distance /2), it seems to me
that as soon as there are less than 100 blocks in prefetch quota,
it will fetch next 50 blocks which means the system will be always
approximately 50 blocks ahead, that will ensure that in this algorithm
it will always perform sequential scan, however eventually this is turning
to be a system where one worker is reading from disk and then other
workers are reading from OS buffers to shared buffers and then getting
the tuple. In this approach only one downside I can see and that is
there could be times during execution where some/all workers will have
to wait on the worker doing prefetching, however I think we should try
this approach and see how it works.

Another thing is that I think prefetching is not supported on all platforms
(Windows) and for such systems as per above algorithm we need to
rely on block-by-block method.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#90Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#89)
Re: Parallel Seq Scan

On Mon, Jan 19, 2015 at 2:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I think I got the idea what you want to achieve via
prefetching. So assuming prefetch_distance = 100 and
prefetch_increment = 50 (prefetch_distance /2), it seems to me
that as soon as there are less than 100 blocks in prefetch quota,
it will fetch next 50 blocks which means the system will be always
approximately 50 blocks ahead, that will ensure that in this algorithm
it will always perform sequential scan, however eventually this is turning
to be a system where one worker is reading from disk and then other
workers are reading from OS buffers to shared buffers and then getting
the tuple. In this approach only one downside I can see and that is
there could be times during execution where some/all workers will have
to wait on the worker doing prefetching, however I think we should try
this approach and see how it works.

Right. We probably want to make prefetch_distance a GUC. After all,
we currently rely on the operating system for prefetching, and the
operating system has a setting for this, at least on Linux (blockdev
--getra). It's possible, however, that we don't need this at all,
because the OS might be smart enough to figure it out for us. It's
probably worth testing, though.

Another thing is that I think prefetching is not supported on all platforms
(Windows) and for such systems as per above algorithm we need to
rely on block-by-block method.

Well, I think we should try to set up a test to see if this is hurting
us. First, do a sequential-scan of a related too big at least twice
as large as RAM. Then, do a parallel sequential scan of the same
relation with 2 workers. Repeat these in alternation several times.
If the operating system is accomplishing meaningful readahead, and the
parallel sequential scan is breaking it, then since the test is
I/O-bound I would expect to see the parallel scan actually being
slower than the normal way.

Or perhaps there is some other test that would be better (ideas
welcome) but the point is we may need something like this, but we
should try to figure out whether we need it before spending too much
time on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#85)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other of
which gets used by printtup.c for tuples.

Okay, I will try to change the way as suggested without doing
switching, but this way we need to do it separately for 'T', 'D', and
'C' messages.

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:
/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v4.patchapplication/octet-stream; name=parallel_seqscan_v4.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..1afac59 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -243,7 +243,19 @@ SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist, int16 *formats)
 				pq_sendint(&buf, 0, 2);
 		}
 	}
-	pq_endmessage(&buf);
+
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 }
 
 /*
@@ -371,7 +383,18 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 		}
 	}
 
-	pq_endmessage(&buf);
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..8c7dedb
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,373 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId);
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId);
+
+/*
+ * shm_beginscan -
+ *		Initializes the shared memory scan descriptor to retrieve tuples
+ *		from worker backends. 
+ */
+ShmScanDesc
+shm_beginscan(int num_queues)
+{
+	ShmScanDesc		shmscan;
+
+	shmscan = palloc(sizeof(ShmScanDescData));
+
+	shmscan->num_shm_queues = num_queues;
+	shmscan->ss_cqueue = -1;
+	shmscan->shmscan_inited	= false;
+
+	return shmscan;
+}
+
+/*
+ * ExecInitWorkerResult -
+ *		Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers)
+{
+	worker_result	workerResult;
+	int				i;
+	int	natts = tupdesc->natts;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+	workerResult->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	workerResult->typioparams = palloc(sizeof(Oid) * natts);
+	workerResult->num_shm_queues = nWorkers;
+	workerResult->has_row_description = palloc0(sizeof(bool) * nWorkers);
+	workerResult->queue_detached = palloc0(sizeof(bool) * nWorkers);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &workerResult->typioparams[i]);
+		fmgr_info(receive_function_id, &workerResult->receive_functions[i]);
+	}
+
+	return workerResult;
+}
+
+
+/*
+ * shm_getnext -
+ *		Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+			worker_result resultState, shm_mq_handle **responseq,
+			TupleDesc tupdesc, ScanDirection direction, bool *fromheap)
+{
+	shm_mq_result	res;
+	Size			nbytes;
+	void			*data;
+	StringInfoData	msg;
+	int				queueId = 0;
+
+	/*
+	 * calculate next starting queue used for fetching tuples
+	 */
+	if(!shmScan->shmscan_inited)
+	{
+		shmScan->shmscan_inited = true;
+		Assert(shmScan->num_shm_queues > 0);
+		queueId = 0;
+	}
+	else
+		queueId = shmScan->ss_cqueue;
+
+	/* Read and processes messages from the shared memory queues. */
+	for(;;)
+	{
+		if (!resultState->all_queues_detached)
+		{
+			if (queueId == shmScan->num_shm_queues)
+				queueId = 0;
+
+			/*
+			 * Don't fetch from detached queue.  This loop could continue
+			 * forever, if we reach a situation such that all queue's are
+			 * detached, however we won't reach here if that is the case.
+			 */
+			while (resultState->queue_detached[queueId])
+			{
+				++queueId;
+				if (queueId == shmScan->num_shm_queues)
+					queueId = 0;
+			}
+
+			for (;;)
+			{
+				/*
+				 * mark current queue used for fetching tuples, this is used
+				 * to fetch consecutive tuples from queue used in previous
+				 * fetch.
+				 */
+				shmScan->ss_cqueue = queueId;
+
+				/* Get next message. */
+				res = shm_mq_receive(responseq[queueId], &nbytes, &data, true);
+				if (res == SHM_MQ_DETACHED)
+				{
+					/*
+					 * mark the queue that got detached, so that we don't
+					 * try to fetch from it again.
+					 */
+					resultState->queue_detached[queueId] = true;
+					resultState->has_row_description[queueId] = false;
+					--resultState->num_shm_queues;
+					/*
+					 * if we have exhausted data from all worker queues, then don't
+					 * process data from queues.
+					 */
+					if (resultState->num_shm_queues <= 0)
+						resultState->all_queues_detached = true;
+					break;
+				}
+				else if (res == SHM_MQ_WOULD_BLOCK)
+					break;
+				else if (res == SHM_MQ_SUCCESS)
+				{
+					bool rettuple;
+					initStringInfo(&msg);
+					appendBinaryStringInfo(&msg, data, nbytes);
+					rettuple = HandleParallelTupleMessage(resultState, tupdesc, &msg, queueId);
+					pfree(msg.data);
+					if (rettuple)
+					{
+						*fromheap = false;
+						return resultState->tuple;
+					}
+				}
+			}
+		}
+
+		/*
+		 * if we have checked all the message queue's and didn't find
+		 * any message or we have already fetched all the data from queue's,
+		 * then it's time to fetch directly from heap.  Reset the current
+		 * queue as the first queue from which we need to receive tuples.
+		 */
+		if ((queueId == shmScan->num_shm_queues - 1 ||
+			 resultState->all_queues_detached) &&
+			 !resultState->all_heap_fetched)
+		{
+			HeapTuple	tuple;
+			shmScan->ss_cqueue = 0;
+			tuple = heap_getnext(scanDesc, direction);
+			if (tuple)
+			{
+				*fromheap = true;
+				return tuple;
+			}
+			else if (tuple == NULL && resultState->all_queues_detached)
+				break;
+			else
+				resultState->all_heap_fetched = true;
+		}
+		else if (resultState->all_queues_detached &&
+				 resultState->all_heap_fetched)
+			break;
+
+		/* check the data in next queue. */
+		++queueId;
+	}
+
+	return NULL;
+}
+
+/*
+ * HandleParallelTupleMessage -
+ * Handle a single tuple related protocol message received from
+ * a single parallel worker.
+ */
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId)
+{
+	char	msgtype;
+	bool	rettuple = false;
+
+	msgtype = pq_getmsgbyte(msg);
+
+	/* Dispatch on message type. */
+	switch (msgtype)
+	{
+		case 'T':
+			{
+				int16	natts = pq_getmsgint(msg, 2);
+				int16	i;
+
+				if (resultState->has_row_description[queueId])
+					elog(ERROR, "multiple RowDescription messages");
+				resultState->has_row_description[queueId] = true;
+				if (natts != tupdesc->natts)
+					ereport(ERROR,
+							(errcode(ERRCODE_DATATYPE_MISMATCH),
+								errmsg("worker result rowtype does not match "
+								"the specified FROM clause rowtype")));
+
+				for (i = 0; i < natts; ++i)
+				{
+					Oid		type_id;
+
+					(void) pq_getmsgstring(msg);	/* name */
+					(void) pq_getmsgint(msg, 4);	/* table OID */
+					(void) pq_getmsgint(msg, 2);	/* table attnum */
+					type_id = pq_getmsgint(msg, 4);	/* type OID */
+					(void) pq_getmsgint(msg, 2);	/* type length */
+					(void) pq_getmsgint(msg, 4);	/* typmod */
+					(void) pq_getmsgint(msg, 2);	/* format code */
+
+					if (type_id != tupdesc->attrs[i]->atttypid)
+						ereport(ERROR,
+								(errcode(ERRCODE_DATATYPE_MISMATCH),
+								 errmsg("remote query result rowtype does not match "
+										"the specified FROM clause rowtype")));
+				}
+
+				pq_getmsgend(msg);
+
+				break;
+			}
+		case 'D':
+			{
+				/* Handle DataRow message. */
+				resultState->tuple = form_result_tuple(resultState, tupdesc, msg, queueId);
+				rettuple = true;
+				break;
+			}
+		case 'C':
+			{
+				/*
+					* Handle CommandComplete message. Ignore tags sent by
+					* worker backend as we are anyway going to use tag of
+					* master backend for sending the same to client.
+					*/
+				(void) pq_getmsgstring(msg);
+				break;
+			}
+		case 'G':
+		case 'H':
+		case 'W':
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("COPY protocol not allowed in worker")));
+			}
+		default:
+			elog(WARNING, "unknown message type: %c", msg->data[0]);
+			break;
+	}
+
+	return rettuple;
+}
+
+/*
+ * form_result_tuple -
+ * Parse a DataRow message and form a result tuple.
+ */
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId)
+{
+	/* Handle DataRow message. */
+	int16	natts = pq_getmsgint(msg, 2);
+	int16	i;
+	Datum  *values = NULL;
+	bool   *isnull = NULL;
+	HeapTuple	tuple;
+	StringInfoData	buf;
+
+	if (!resultState->has_row_description[queueId])
+		elog(ERROR, "DataRow not preceded by RowDescription");
+	if (natts != tupdesc->natts)
+		elog(ERROR, "malformed DataRow");
+	if (natts > 0)
+	{
+		values = palloc(natts * sizeof(Datum));
+		isnull = palloc(natts * sizeof(bool));
+	}
+	initStringInfo(&buf);
+
+	for (i = 0; i < natts; ++i)
+	{
+		int32	bytes = pq_getmsgint(msg, 4);
+
+		if (bytes < 0)
+		{
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											NULL,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = true;
+		}
+		else
+		{
+			resetStringInfo(&buf);
+			appendBinaryStringInfo(&buf, pq_getmsgbytes(msg, bytes), bytes);
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											&buf,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = false;
+		}
+	}
+
+	pq_getmsgend(msg);
+
+	tuple = heap_form_tuple(tupdesc, values, isnull);
+
+	/*
+	 * Release locally palloc'd space.  XXX would probably be good to pfree
+	 * values of pass-by-reference datums, as well.
+	 */
+	pfree(values);
+	pfree(isnull);
+
+	return tuple;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8a0be5d..bb581a8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -713,6 +713,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -909,6 +910,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_ParallelSeqScan:
+			pname = sname = "Parallel Seq Scan";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1058,6 +1062,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1324,6 +1329,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_ParallelSeqScan:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((ParallelSeqScan *) plan)->num_workers, es);
+			ExplainPropertyInteger("Number of Blocks Per Worker",
+				((ParallelSeqScan *) plan)->num_blocks_per_worker, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2141,6 +2156,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..9a8ca75 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,7 +21,7 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeSeqscan.o nodeParallelSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..f77a77f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeParallelSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +191,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_ParallelSeqScan:
+			result = (PlanState *) ExecInitParallelSeqScan((ParallelSeqScan *) node,
+														   estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +412,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			result = ExecParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +654,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			ExecEndParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 3f0d809..39c624d 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -191,8 +191,17 @@ ExecScan(ScanState *node,
 		 * check for non-nil qual here to avoid a function call to ExecQual()
 		 * when the qual is nil ... saves only a few cycles, but they add up
 		 * ...
+		 *
+		 * check for non-heap tuples (can get such tuples from shared memory
+		 * message queue's in case of parallel query), for such tuples no need
+		 * to perform qualification as for them the same is done by backend
+		 * worker.  This case will happen only for parallel query where we push
+		 * down the qualification.
+		 * XXX - We can do this optimization for projection as well, but for
+		 * now it is okay, as we don't allow parallel query if there are
+		 * expressions involved in target list.
 		 */
-		if (!qual || ExecQual(qual, econtext, false))
+		if (!slot->tts_fromheap || !qual || ExecQual(qual, econtext, false))
 		{
 			/*
 			 * Found a satisfactory scan tuple.
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 753754d..4c5bd88 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -123,6 +123,7 @@ MakeTupleTableSlot(void)
 	slot->tts_values = NULL;
 	slot->tts_isnull = NULL;
 	slot->tts_mintuple = NULL;
+	slot->tts_fromheap	= true;
 
 	return slot;
 }
@@ -473,6 +474,8 @@ ExecClearTuple(TupleTableSlot *slot)	/* slot in which to store tuple */
 	slot->tts_isempty = true;
 	slot->tts_nvalid = 0;
 
+	slot->tts_fromheap = true;
+
 	return slot;
 }
 
diff --git a/src/backend/executor/nodeParallelSeqscan.c b/src/backend/executor/nodeParallelSeqscan.c
new file mode 100644
index 0000000..5200df5
--- /dev/null
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -0,0 +1,318 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeParallelSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeParallelSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecParallelSeqScan				sequentially scans a relation.
+ *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		ExecInitParallelSeqScan			creates and initializes a parallel seqscan node.
+ *		ExecEndParallelSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ParallelSeqNext
+ *
+ *		This is a workhorse for ExecParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ParallelSeqNext(ParallelSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+	bool			fromheap = true;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(scandesc, node->pss_currentShmScanDesc,
+						node->pss_workerResult,
+						node->responseq,
+						node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+						direction, &fromheap);
+
+	slot->tts_fromheap = fromheap;
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass '!fromheap'
+	 * because tuples returned by shm_getnext() are either pointers that are
+	 * created with palloc() or are pointers onto disk pages and so it should
+	 * be pfree()'d accordingly.  Note also that ExecStoreTuple will increment
+	 * the refcount of the buffer; the refcount will not be dropped until the
+	 * tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   !fromheap);	/* pfree this pointer if not from heap */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * ParallelSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+ParallelSeqRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, ParallelSeqScan never use keys in
+	 * shm_beginscan/heap_beginscan (and this is very bad) - so, here
+	 * we do not check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitParallelScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitParallelScanRelation(SeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/* initialize a heapscan */
+	currentScanDesc = heap_beginscan(currentRelation,
+									 estate->es_snapshot,
+									 0,
+									 NULL);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+ParallelSeqScanState *
+ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags)
+{
+	ParallelSeqScanState *parallelscanstate;
+	ShmScanDesc			 currentShmScanDesc;
+	worker_result		 workerResult;
+	BlockNumber			 end_block;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	parallelscanstate = makeNode(ParallelSeqScanState);
+	parallelscanstate->ss.ps.plan = (Plan *) node;
+	parallelscanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &parallelscanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	parallelscanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) parallelscanstate);
+	parallelscanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) parallelscanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &parallelscanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &parallelscanstate->ss);
+	
+	/*
+	 * initialize scan relation
+	 */
+	InitParallelScanRelation(&parallelscanstate->ss, estate, eflags);
+
+	parallelscanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&parallelscanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&parallelscanstate->ss);
+
+	/*
+	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
+	 * here, no need to start workers.
+	 */
+	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+		return parallelscanstate;
+	
+	/* Initialize the workers required to perform parallel scan. */
+	InitiateWorkers(parallelscanstate->ss.ss_currentRelation->rd_id,
+					node->scan.plan.targetlist,
+					node->scan.plan.qual,
+					&parallelscanstate->responseq,
+					&parallelscanstate->pcxt,
+					node->num_blocks_per_worker,
+					node->num_workers);
+
+	/* Initialize the blocks to be scanned by master backend. */
+	end_block = (parallelscanstate->pcxt->nworkers + 1) *
+				node->num_blocks_per_worker;
+	((SeqScan*) parallelscanstate->ss.ps.plan)->startblock =
+								end_block - node->num_blocks_per_worker;
+	/*
+	 * As master backend is the last backend to scan the blocks, it
+	 * should scan all the blocks.
+	 */
+	((SeqScan*) parallelscanstate->ss.ps.plan)->endblock = InvalidBlockNumber;
+
+	/* Set the scan limits for master backend. */
+	heap_setscanlimits(parallelscanstate->ss.ss_currentScanDesc,
+					   ((SeqScan*) parallelscanstate->ss.ps.plan)->startblock,
+					   (parallelscanstate->ss.ss_currentScanDesc->rs_nblocks -
+					   ((SeqScan*) parallelscanstate->ss.ps.plan)->startblock));
+
+	/*
+	 * Use result tuple descriptor to fetch data from shared memory queues
+	 * as the worker backends would have put the data after projection.
+	 * Number of queue's must be equal to number of worker backends.
+	 */
+	currentShmScanDesc = shm_beginscan(parallelscanstate->pcxt->nworkers);
+	workerResult = ExecInitWorkerResult(parallelscanstate->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+										parallelscanstate->pcxt->nworkers);
+
+	parallelscanstate->pss_currentShmScanDesc = currentShmScanDesc;
+	parallelscanstate->pss_workerResult	= workerResult;
+
+	return parallelscanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelSeqScan(node)
+ *
+ *		Scans the relation sequentially from multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecParallelSeqScan(ParallelSeqScanState *node)
+{
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) ParallelSeqNext,
+					(ExecScanRecheckMtd) ParallelSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndParallelSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndParallelSeqScan(ParallelSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	if (node->pcxt)
+	{
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		ExitParallelMode();
+	}
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..5780df0 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -139,6 +139,22 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 									 0,
 									 NULL);
 
+	/*
+	 * set the scan limits, if requested by plan.  If the end block
+	 * is not specified, then scan all the blocks till end.
+	 */
+	if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber &&
+		((SeqScan *) node->ps.plan)->endblock != InvalidBlockNumber)
+		heap_setscanlimits(currentScanDesc,
+						   ((SeqScan *) node->ps.plan)->startblock,
+						   (((SeqScan *) node->ps.plan)->endblock -
+						   ((SeqScan *) node->ps.plan)->startblock));
+	else if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber)
+			 heap_setscanlimits(currentScanDesc,
+								((SeqScan *) node->ps.plan)->startblock,
+								(currentScanDesc->rs_nblocks -
+								((SeqScan *) node->ps.plan)->startblock));
+
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
 
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index f12f2d5..cfab8b5 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -26,6 +26,8 @@ static bool pq_mq_busy = false;
 static pid_t pq_mq_parallel_master_pid = 0;
 static pid_t pq_mq_parallel_master_backend_id = InvalidBackendId;
 
+static shm_mq_handle *pq_mq_tuple_handle = NULL;
+
 static void mq_comm_reset(void);
 static int	mq_flush(void);
 static int	mq_flush_if_writable(void);
@@ -61,6 +63,26 @@ pq_redirect_to_shm_mq(shm_mq *mq, shm_mq_handle *mqh)
 }
 
 /*
+ * Arrange to send some frontend/backend protocol messages to a shared-memory
+ * tuple message queue.
+ */
+void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh)
+{
+	pq_mq_tuple_handle = mqh;
+}
+
+/*
+ * Check if tuples can be sent through tuple shared-memory
+ * message queue.
+ */
+bool
+is_tuple_shm_mq_enabled(void)
+{
+	return pq_mq_tuple_handle ? true : false;
+}
+
+/*
  * Arrange to SendProcSignal() to the parallel master each time we transmit
  * message data via the shm_mq.
  */
@@ -161,6 +183,42 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 	return 0;
 }
 
+/*
+ * Transmit a libpq protocol message to the shared memory message queue
+ * via pq_mq_tuple_handle.  We don't include a length word, because the
+ * receiver will know the length of the message from shm_mq_receive().
+ */
+int
+mq_putmessage_direct(char msgtype, const char *s, size_t len)
+{
+	shm_mq_iovec	iov[2];
+	shm_mq_result	result;
+
+	iov[0].data = &msgtype;
+	iov[0].len = 1;
+	iov[1].data = s;
+	iov[1].len = len;
+
+	Assert(pq_mq_tuple_handle != NULL);
+
+	for (;;)
+	{
+		result = shm_mq_sendv(pq_mq_tuple_handle, iov, 2, true);
+
+		if (result != SHM_MQ_WOULD_BLOCK)
+			break;
+
+		WaitLatch(&MyProc->procLatch, WL_LATCH_SET, 0);
+		CHECK_FOR_INTERRUPTS();
+		ResetLatch(&MyProc->procLatch);
+	}
+
+	Assert(result == SHM_MQ_SUCCESS || result == SHM_MQ_DETACHED);
+	if (result != SHM_MQ_SUCCESS)
+		return EOF;
+	return 0;
+}
+
 static void
 mq_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 020558b..4abfd25 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +227,73 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_parallelseqscan
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+	double		spc_seq_page_cost;
+	QualCost	qpqual_cost;
+	Cost		cpu_per_tuple;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	if (!enable_seqscan)
+		startup_cost += disable_cost;
+
+	/* fetch estimated page cost for tablespace containing table */
+	get_tablespace_page_costs(baserel->reltablespace,
+							  NULL,
+							  &spc_seq_page_cost);
+
+	/*
+	 * disk costs
+	 */
+	run_cost += spc_seq_page_cost * baserel->pages;
+
+	/* CPU costs */
+	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
+
+	startup_cost += qpqual_cost.startup;
+	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+	run_cost += cpu_per_tuple * baserel->tuples;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..5245652
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,126 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+
+
+/*
+ *	IsTargetListContainNonVars -
+ *		Check if target list contain non-var entries.
+ */
+static bool
+IsTargetListContainNonVars(List *targetlist)
+{
+	ListCell   *l;
+
+	foreach(l, targetlist)
+	{
+		TargetEntry *te = (TargetEntry *) lfirst(l);
+
+		if (!IsA(te, TargetEntry))
+			continue;			/* probably should never happen */
+		if (!IsA(te->expr, Var))
+			return true;
+	}
+	return false;
+}
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/*
+	 * parallel scan is not supported for joins.
+	 */
+	if (root->simple_rel_array_size > 2)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	/*
+	 * parallel scan is not supported for non-var target list.
+	 *
+	 * XXX - This is to keep the implementation simple, we can do this
+	 * in future.  Here we are checking by passing root->parse->targetList
+	 * instead of rel->reltargetlist because rel->targetlist always contains
+	 * Vars (refer build_base_rel_tlists).
+	 */
+	if (IsTargetListContainNonVars(root->parse->targetList))
+	   return;
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	add_path(rel, (Path *) create_parallelseqscan_path(root, rel,
+													   num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 655be81..1c7f640 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,9 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_parallelseqscan_plan(PlannerInfo *root,
+										 ParallelSeqPath *best_path,
+										 List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +103,9 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static ParallelSeqScan *make_parallelseqscan(List *qptlist, List *qpqual,
+											 Index scanrelid, int nworkers,
+											 BlockNumber nblocksperworker);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +234,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +350,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_ParallelSeqScan:
+			plan = (Plan *) create_parallelseqscan_plan(root,
+														(ParallelSeqPath *) best_path,
+														tlist,
+														scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -1133,6 +1147,71 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_worker_seqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by worker
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock)
+{
+	SeqScan    *scan_plan;
+
+	/*
+	 * Pass scan_relid as 1, this is okay for now as sequence scan worker
+	 * is allowed to operate on just one relation.
+	 * XXX - we should ideally get scanrelid from master backend.
+	 */
+	scan_plan = make_seqscan(targetList,
+							 scan_clauses,
+							 1);
+
+	scan_plan->startblock = startBlock;
+	scan_plan->endblock = endBlock;
+	return scan_plan;
+}
+
+/*
+ * create_parallelseqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by 'best_path'
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+static Scan *
+create_parallelseqscan_plan(PlannerInfo *root, ParallelSeqPath *best_path,
+					List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->path.param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_parallelseqscan(tlist,
+											  scan_clauses,
+											  scan_relid,
+											  best_path->num_workers,
+											  best_path->num_blocks_per_worker);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3314,6 +3393,30 @@ make_seqscan(List *qptlist,
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->scanrelid = scanrelid;
+	node->startblock = InvalidBlockNumber;
+	node->endblock = InvalidBlockNumber;
+
+	return node;
+}
+
+static ParallelSeqScan *
+make_parallelseqscan(List *qptlist,
+			   List *qpqual,
+			   Index scanrelid,
+			   int nworkers,
+			   BlockNumber nblocksperworker)
+{
+	ParallelSeqScan *node = makeNode(ParallelSeqScan);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+	node->num_blocks_per_worker = nblocksperworker;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9cbbcfb..d2b1621 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,71 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+/*
+ * create_worker_seqscan_plannedstmt
+ *	Returns a planned statement to be used by worker for execution.
+ *	Ideally, master backend should form worker's planned statement
+ *	and pass the same to worker, however for now  master backend
+ *	just passes the required information and PlannedStmt is then
+ *	constructed by worker.
+ */
+PlannedStmt	*
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt)
+{
+	AclMode		required_access = ACL_SELECT;
+	RangeTblEntry *rte;
+	SeqScan    *scan_plan;
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	rte = makeNode(RangeTblEntry);
+	rte->rtekind = RTE_RELATION;
+	rte->relid = workerstmt->relId;
+	rte->relkind = 'r';
+	rte->requiredPerms = required_access;
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) workerstmt->qual);
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, workerstmt->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	scan_plan = create_worker_seqscan_plan(workerstmt->targetList,
+										   workerstmt->qual,
+										   workerstmt->startBlock,
+										   workerstmt->endBlock);
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) scan_plan;
+	result->rtable = list_make1(rte);
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->relationOids = lappend_oid(result->relationOids, rte->relid);;
+	result->invalItems = NIL;
+	result->nParamExec = 0;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 7703946..3a44aef 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -436,6 +436,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..538e612 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,41 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_parallelseqscan_path
+ *	  Creates a path corresponding to a parallel sequential scan, returning the
+ *	  pathnode.
+ */
+ParallelSeqPath *
+create_parallelseqscan_path(PlannerInfo *root, RelOptInfo *rel, int nWorkers)
+{
+	ParallelSeqPath	   *pathnode = makeNode(ParallelSeqPath);
+
+	pathnode->path.pathtype = T_ParallelSeqScan;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->num_workers = nWorkers;
+	/*
+	 * Divide the work equally among all the workers, for cases
+	 * where division is not equal (example if there are total
+	 * 10 blocks and 3 workers, then as per below calculation each
+	 * worker will scan 3 blocks), last worker will be responsible for
+	 * scanning remaining blocks.  We always consider master backend
+	 * as last worker because it will first try to get the tuples
+	 * scanned by other workers.  For calculation of number of blocks
+	 * per worker, an additional worker needs to be consider for
+	 * master backend.
+	 */
+	pathnode->num_blocks_per_worker = rel->pages / (nWorkers + 1);
+
+	cost_parallelseqscan(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..d52d1b6
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,224 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitiateWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/parallel.h"
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PG_WORKER_KEY_RELID			0
+#define PG_WORKER_KEY_TARGETLIST	1
+#define PG_WORKER_KEY_QUAL			2
+#define PG_WORKER_KEY_BLOCKS		3
+#define PARALLEL_KEY_TUPLE_QUEUE	4
+
+static void exec_worker_message(dsm_segment *seg, shm_toc *toc);
+
+/*
+ * InitiateWorkers
+ *		It sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitiateWorkers(Oid relId, List *targetList, List *qual,
+				shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+				BlockNumber numBlocksPerWorker, int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	int			i;
+	Size		targetlist_len, qual_len;
+	BlockNumber	*num_blocks_per_worker;
+	Oid		   *reliddata;
+	char	   *targetlistdata;
+	char	   *targetlist_str;
+	char	   *qualdata;
+	char	   *qual_str;
+	char	   *tuple_queue_space;
+	ParallelContext *pcxt;
+	shm_mq	   *mq;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(exec_worker_message, nWorkers);
+
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(relId));
+
+	targetlist_str = nodeToString(targetList);
+	targetlist_len = strlen(targetlist_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, targetlist_len);
+
+	qual_str = nodeToString(qual);
+	qual_len = strlen(qual_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, qual_len);
+
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(BlockNumber));
+
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * nWorkers);
+
+	/* 5 keys for parallel seq. scan specific data. */
+	shm_toc_estimate_keys(&pcxt->estimator, 5);
+
+	InitializeParallelDSM(pcxt);
+
+	/* Store scan relation id in dynamic shared memory. */
+	reliddata = shm_toc_allocate(pcxt->toc, sizeof(Oid));
+	*reliddata = relId;
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_RELID, reliddata);
+
+	/* Store target list in dynamic shared memory. */
+	targetlistdata = shm_toc_allocate(pcxt->toc, targetlist_len);
+	memcpy(targetlistdata, targetlist_str, targetlist_len);
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_TARGETLIST, targetlistdata);
+
+	/* Store qual list in dynamic shared memory. */
+	qualdata = shm_toc_allocate(pcxt->toc, qual_len);
+	memcpy(qualdata, qual_str, qual_len);
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_QUAL, qualdata);
+
+	/* Store blocks to be scanned by each worker in dynamic shared memory. */
+	num_blocks_per_worker = shm_toc_allocate(pcxt->toc, sizeof(BlockNumber));
+	*num_blocks_per_worker = numBlocksPerWorker;
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_BLOCKS, num_blocks_per_worker);
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(nWorkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data. 
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+
+	/* Register backend workers. */
+	LaunchParallelWorkers(pcxt);
+
+	for (i = 0; i < pcxt->nworkers; ++i)
+		shm_mq_set_handle((*responseqp)[i], pcxt->worker[i].bgwhandle);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+
+/*
+ * exec_worker_message
+ *
+ * Execute the work assigned to a worker by master backend.
+ */
+void
+exec_worker_message(dsm_segment *seg, shm_toc *toc)
+{
+	char	    *targetlistdata;
+	char		*qualdata;
+	char		*tuple_queue_space;
+	BlockNumber *num_blocks_per_worker;
+	BlockNumber  start_block;
+	BlockNumber  end_block;
+	shm_mq	    *mq;
+	shm_mq_handle *responseq;
+	Oid			*relId;
+	List		*targetList = NIL;
+	List		*qual = NIL;
+	worker_stmt	*workerstmt;
+	
+	relId = shm_toc_lookup(toc, PG_WORKER_KEY_RELID);
+	targetlistdata = shm_toc_lookup(toc, PG_WORKER_KEY_TARGETLIST);
+	qualdata = shm_toc_lookup(toc, PG_WORKER_KEY_QUAL);
+	num_blocks_per_worker = shm_toc_lookup(toc, PG_WORKER_KEY_BLOCKS);
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(mq, MyProc);
+	responseq = shm_mq_attach(mq, seg, NULL);
+
+	end_block = (ParallelWorkerNumber + 1) * (*num_blocks_per_worker);
+	start_block = end_block - (*num_blocks_per_worker);
+
+	/* Redirect protocol messages to responseq. */
+	pq_redirect_to_tuple_shm_mq(responseq);
+
+	/* Restore targetList and qual passed by main backend. */
+	targetList = (List *) stringToNode(targetlistdata);
+	qual = (List *) stringToNode(qualdata);
+
+	workerstmt = palloc(sizeof(worker_stmt));
+
+	workerstmt->relId = *relId;
+	workerstmt->targetList = targetList;
+	workerstmt->qual = qual;
+	workerstmt->startBlock = start_block;
+
+	/*
+	 * Last worker should scan all the remaining blocks.
+	 *
+	 * XXX - It is possible that expected number of workers
+	 * won't get started, so to handle such cases master
+	 * backend should scan remaining blocks.
+	 */
+	workerstmt->endBlock = end_block;
+
+	/* Execute the worker command. */
+	exec_worker_stmt(workerstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 47ed84c..994eeba 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..da6e099 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -148,10 +148,19 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestRemoteExecute:
 
 			/*
-			 * We assume the commandTag is plain ASCII and therefore requires
-			 * no encoding conversion.
+			 * Send the message via shared-memory tuple queue, if the same
+			 * is enabled.
 			 */
-			pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			if (is_tuple_shm_mq_enabled())
+				mq_putmessage_direct('C', commandTag, strlen(commandTag) + 1);
+			else
+			{
+				/*
+				 * We assume the commandTag is plain ASCII and therefore requires
+				 * no encoding conversion.
+				 */
+				pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			}
 			break;
 
 		case DestNone:
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index bbad0dc..411f150 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -55,6 +55,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1132,6 +1133,100 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * execute_worker_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_worker_stmt(worker_stmt *workerstmt)
+{
+	Portal		portal;
+	int16		format = 1;
+	DestReceiver *receiver;
+	bool		isTopLevel = true;
+	PlannedStmt	*planned_stmt;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+	BeginCommand("SELECT", DestNone);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	planned_stmt = create_worker_seqscan_plannedstmt(workerstmt);
+	/*
+	 * Create unnamed portal to run the query or queries in. If there
+	 * already is one, silently drop it.
+	 */
+	portal = CreatePortal("", true, true);
+	/* Don't display the portal in pg_cursors */
+	portal->visible = false;
+
+	/*
+	 * We don't have to copy anything into the portal, because everything
+	 * we are passing here is in MessageContext, which will outlive the
+	 * portal anyway.
+	 */
+	PortalDefineQuery(portal,
+					  NULL,
+					  "",
+					  "",
+					  list_make1(planned_stmt),
+					  NULL);
+
+	/*
+	 * Start the portal.  No parameters here.
+	 */
+	PortalStart(portal, NULL, 0, InvalidSnapshot);
+
+	/* We always use binary format, for efficiency. */
+	PortalSetResultFormat(portal, 1, &format);
+
+	receiver = CreateDestReceiver(DestRemote);
+	SetRemoteDestReceiverParams(receiver, portal);
+
+	/*
+	 * Only once the portal and destreceiver have been established can
+	 * we return to the transaction context.  All that stuff needs to
+	 * survive an internal commit inside PortalRun!
+	 */
+	MemoryContextSwitchTo(oldcontext);
+
+	/*
+	 * Run the portal to completion, and then drop it (and the receiver).
+	 */
+	(void) PortalRun(portal,
+					 FETCH_ALL,
+					 isTopLevel,
+					 receiver,
+					 receiver,
+					 NULL);
+
+	(*receiver->rDestroy) (receiver);
+
+	PortalDrop(portal, false);
+
+	/*
+	 * Send appropriate CommandComplete to client.  There is no
+	 * need to send completion tag from worker as that won't be
+	 * of any use considering the completiong tag of master backend
+	 * will be used for sending to client.
+	 */
+	EndCommand("", DestRemote);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d9bfa25..b8f90b7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -630,6 +630,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2445,6 +2447,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2632,6 +2644,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..784cfe0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -497,6 +500,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 761ba1f..00ad468 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -45,6 +45,8 @@ typedef struct ParallelContext
 
 extern bool ParallelMessagePending;
 
+extern int ParallelWorkerNumber;
+
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExtension(char *library_name,
 								  char *function_name, int nworkers);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9bb6362..3c56b49 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -105,4 +105,13 @@ typedef struct SysScanDescData
 	Snapshot	snapshot;		/* snapshot to unregister at end of scan */
 }	SysScanDescData;
 
+/* struct for scanning shared memory queues */
+typedef struct ShmScanDescData
+{
+	/* scan current state */
+	int			num_shm_queues;	/* number of shared memory queues used in scan. */
+	int			ss_cqueue;		/* current queue # in scan, if any */
+	bool		shmscan_inited;		/* false = scan not init'd yet */
+}	ShmScanDescData;
+
 #endif   /* RELSCAN_H */
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..df56cfe
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,44 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	FmgrInfo   *receive_functions;
+	Oid		   *typioparams;
+	HeapTuple  tuple;
+	int		   num_shm_queues;
+	bool	   *has_row_description;
+	bool	   *queue_detached;
+	bool	   all_queues_detached;
+	bool	   all_heap_fetched;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+typedef struct ShmScanDescData *ShmScanDesc;
+
+extern worker_result ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers);
+extern ShmScanDesc shm_beginscan(int num_queues);
+extern HeapTuple shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+							 worker_result resultState, shm_mq_handle **responseq,
+							 TupleDesc tupdesc, ScanDirection direction, bool *fromheap);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/nodeParallelSeqscan.h b/src/include/executor/nodeParallelSeqscan.h
new file mode 100644
index 0000000..b638a24
--- /dev/null
+++ b/src/include/executor/nodeParallelSeqscan.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeparallelSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeParallelSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARALLELSEQSCAN_H
+#define NODEPARALLELSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern ParallelSeqScanState *ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecParallelSeqScan(ParallelSeqScanState *node);
+extern void ExecEndParallelSeqScan(ParallelSeqScanState *node);
+
+extern Size EstimateScanRelationIdSpace(Oid relId);
+extern void SerializeScanRelationId(Oid relId, Size maxsize,
+									char *start_address);
+extern void RestoreScanRelationId(Oid *relId, char *start_address);
+
+extern Size EstimateTargetListSpace(List *targetList);
+extern void SerializeTargetList(List *targetList, Size maxsize,
+								char *start_address);
+extern void RestoreTargetList(List **targetList, char *start_address);
+
+#endif   /* NODEPARALLELSEQSCAN_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 48f84bf..e5dec1e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -127,6 +127,8 @@ typedef struct TupleTableSlot
 	MinimalTuple tts_mintuple;	/* minimal tuple, or NULL if none */
 	HeapTupleData tts_minhdr;	/* workspace for minimal-tuple-only case */
 	long		tts_off;		/* saved state for slot_deform_tuple */
+	bool		tts_fromheap;	/* indicates whether the tuple is fetched from
+								   heap or shrared memory message queue */
 } TupleTableSlot;
 
 #define TTS_HAS_PHYSICAL_TUPLE(slot)  \
diff --git a/src/include/libpq/pqmq.h b/src/include/libpq/pqmq.h
index ad7589d..067edbe 100644
--- a/src/include/libpq/pqmq.h
+++ b/src/include/libpq/pqmq.h
@@ -19,6 +19,13 @@
 extern void	pq_redirect_to_shm_mq(shm_mq *, shm_mq_handle *);
 extern void pq_set_parallel_master(pid_t pid, BackendId backend_id);
 
+extern int
+mq_putmessage_direct(char msgtype, const char *s, size_t len);
+extern void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh);
+extern bool
+is_tuple_shm_mq_enabled(void);
+
 extern void pq_parse_errornotice(StringInfo str, ErrorData *edata);
 
 #endif   /* PQMQ_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41288ed..86f4731 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,12 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -1212,6 +1215,23 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * ParallelScanState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct ParallelScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle **responseq;
+	ShmScanDesc pss_currentShmScanDesc;
+	worker_result	pss_workerResult;
+} ParallelScanState;
+
+typedef ParallelScanState ParallelSeqScanState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 97ef0fc..b6f1493 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_ParallelSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +98,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_ParallelSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +219,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_ParallelSeqPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b1dfa85..5777271 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -23,6 +23,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +157,15 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for execution. */
+typedef struct worker_stmt
+{
+	Oid			relId;
+	List		*targetList;
+	List		*qual;
+	BlockNumber startBlock;
+	BlockNumber endBlock;
+} worker_stmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 316c9ce..3354398 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,7 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -269,6 +270,8 @@ typedef struct Scan
 {
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+	BlockNumber startblock;		/* block to start seq scan */
+	BlockNumber endblock;		/* block upto which scan has to be done */
 } Scan;
 
 /* ----------------
@@ -278,6 +281,17 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct ParallelSeqScan
+{
+	Scan		scan;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqScan;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..576add5 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct ParallelSeqPath
+{
+	Path		path;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..0b6a469 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..32c3e0d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern ParallelSeqPath *create_parallelseqscan_path(PlannerInfo *root,
+					RelOptInfo *rel, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 082f7d7..ef5a320 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -41,6 +41,9 @@ extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  * prototypes for plan/createplan.c
  */
 extern Plan *create_plan(PlannerInfo *root, Path *best_path);
+extern SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..91ddffe 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..8813b6d
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+extern int	parallel_seqscan_degree;
+extern void InitiateWorkers(Oid relId, List *targetList,
+							List *qual,
+							shm_mq_handle ***responseqp,
+							ParallelContext **pcxtp,
+							BlockNumber numBlocksPerWorker,
+							int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 0a350fd..02cf518 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -83,5 +83,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_worker_stmt(worker_stmt *workerstmt);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#92Thom Brown
thom@linux.com
In reply to: Amit Kapila (#91)
Re: Parallel Seq Scan

On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other of
which gets used by printtup.c for tuples.

Okay, I will try to change the way as suggested without doing
switching, but this way we need to do it separately for 'T', 'D', and
'C' messages.

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:

/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

Which commit is this based against? I'm getting errors with the latest
master:

thom@swift:~/Development/postgresql$ patch -p1 <
~/Downloads/parallel_seqscan_v4.patch
patching file src/backend/access/Makefile
patching file src/backend/access/common/printtup.c
patching file src/backend/access/shmmq/Makefile
patching file src/backend/access/shmmq/shmmqam.c
patching file src/backend/commands/explain.c
Hunk #1 succeeded at 721 (offset 8 lines).
Hunk #2 succeeded at 918 (offset 8 lines).
Hunk #3 succeeded at 1070 (offset 8 lines).
Hunk #4 succeeded at 1337 (offset 8 lines).
Hunk #5 succeeded at 2239 (offset 83 lines).
patching file src/backend/executor/Makefile
patching file src/backend/executor/execProcnode.c
patching file src/backend/executor/execScan.c
patching file src/backend/executor/execTuples.c
patching file src/backend/executor/nodeParallelSeqscan.c
patching file src/backend/executor/nodeSeqscan.c
patching file src/backend/libpq/pqmq.c
Hunk #1 succeeded at 23 with fuzz 2 (offset -3 lines).
Hunk #2 FAILED at 63.
Hunk #3 succeeded at 132 (offset -31 lines).
1 out of 3 hunks FAILED -- saving rejects to file
src/backend/libpq/pqmq.c.rej
patching file src/backend/optimizer/path/Makefile
patching file src/backend/optimizer/path/allpaths.c
patching file src/backend/optimizer/path/costsize.c
patching file src/backend/optimizer/path/parallelpath.c
patching file src/backend/optimizer/plan/createplan.c
patching file src/backend/optimizer/plan/planner.c
patching file src/backend/optimizer/plan/setrefs.c
patching file src/backend/optimizer/util/pathnode.c
patching file src/backend/postmaster/Makefile
patching file src/backend/postmaster/backendworker.c
patching file src/backend/postmaster/postmaster.c
patching file src/backend/tcop/dest.c
patching file src/backend/tcop/postgres.c
Hunk #1 succeeded at 54 (offset -1 lines).
Hunk #2 succeeded at 1132 (offset -1 lines).
patching file src/backend/utils/misc/guc.c
patching file src/backend/utils/misc/postgresql.conf.sample
can't find file to patch at input line 2105
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
|index 761ba1f..00ad468 100644
|--- a/src/include/access/parallel.h
|+++ b/src/include/access/parallel.h
--------------------------
File to patch:

--
Thom

#93Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#92)
Re: Parallel Seq Scan

On Tue, Jan 20, 2015 at 9:43 PM, Thom Brown <thom@linux.com> wrote:

On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other of
which gets used by printtup.c for tuples.

Okay, I will try to change the way as suggested without doing
switching, but this way we need to do it separately for 'T', 'D', and
'C' messages.

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:

/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

Which commit is this based against? I'm getting errors with the latest

master:

It seems to me that you have not applied parallel-mode patch
before applying this patch, can you try once again by first applying
the patch posted by Robert at below link:
/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

commit-id used for this patch - 0b49642

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#94Thom Brown
thom@linux.com
In reply to: Amit Kapila (#93)
Re: Parallel Seq Scan

On 20 January 2015 at 16:55, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 20, 2015 at 9:43 PM, Thom Brown <thom@linux.com> wrote:

On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other

of

which gets used by printtup.c for tuples.

Okay, I will try to change the way as suggested without doing
switching, but this way we need to do it separately for 'T', 'D', and
'C' messages.

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:

/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

Which commit is this based against? I'm getting errors with the latest

master:

It seems to me that you have not applied parallel-mode patch
before applying this patch, can you try once again by first applying
the patch posted by Robert at below link:

/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

commit-id used for this patch - 0b49642

D'oh. Yes, you're completely right. Works fine now.

Thanks.

Thom

#95Thom Brown
thom@linux.com
In reply to: Amit Kapila (#91)
Re: Parallel Seq Scan

On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other of
which gets used by printtup.c for tuples.

Okay, I will try to change the way as suggested without doing
switching, but this way we need to do it separately for 'T', 'D', and
'C' messages.

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:

/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

I'm getting an issue:

➤ psql://thom@[local]:5488/pgbench

# set parallel_seqscan_degree = 8;
SET
Time: 0.248 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..21.22 rows=100 width=4)
Number of Workers: 8
Number of Blocks Per Worker: 11
(3 rows)

Time: 0.322 ms

# explain analyse select c1 from t1;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..21.22 rows=100 width=4) (actual
time=0.024..13.468 rows=100 loops=1)
Number of Workers: 8
Number of Blocks Per Worker: 11
Planning time: 0.040 ms
Execution time: 13.862 ms
(5 rows)

Time: 14.188 ms

➤ psql://thom@[local]:5488/pgbench

# set parallel_seqscan_degree = 10;
SET
Time: 0.219 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..19.18 rows=100 width=4)
Number of Workers: 10
Number of Blocks Per Worker: 9
(3 rows)

Time: 0.375 ms

➤ psql://thom@[local]:5488/pgbench

# explain analyse select c1 from t1;

So setting parallel_seqscan_degree above max_worker_processes causes the
CPU to max out, and the query never returns, or at least not after waiting
2 minutes. Shouldn't it have a ceiling of max_worker_processes?

The original test I performed where I was getting OOM errors now appears to
be fine:

# explain (analyse, buffers, timing) select distinct bid from
pgbench_accounts;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1400411.11..1400412.11 rows=100 width=4) (actual
time=8504.333..8504.335 rows=13 loops=1)
Group Key: bid
Buffers: shared hit=32 read=18183
-> Parallel Seq Scan on pgbench_accounts (cost=0.00..1375411.11
rows=10000000 width=4) (actual time=0.054..7183.494 rows=10000000 loops=1)
Number of Workers: 8
Number of Blocks Per Worker: 18215
Buffers: shared hit=32 read=18183
Planning time: 0.058 ms
Execution time: 8876.967 ms
(9 rows)

Time: 8877.366 ms

Note that I increased seq_page_cost to force a parallel scan in this case.

Thom

#96Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#90)
Re: Parallel Seq Scan

On 1/19/15 7:20 AM, Robert Haas wrote:

Another thing is that I think prefetching is not supported on all platforms
(Windows) and for such systems as per above algorithm we need to
rely on block-by-block method.

Well, I think we should try to set up a test to see if this is hurting
us. First, do a sequential-scan of a related too big at least twice
as large as RAM. Then, do a parallel sequential scan of the same
relation with 2 workers. Repeat these in alternation several times.
If the operating system is accomplishing meaningful readahead, and the
parallel sequential scan is breaking it, then since the test is
I/O-bound I would expect to see the parallel scan actually being
slower than the normal way.

Or perhaps there is some other test that would be better (ideas
welcome) but the point is we may need something like this, but we
should try to figure out whether we need it before spending too much
time on it.

I'm guessing that not all supported platforms have prefetching that actually helps us... but it would be good to actually know if that's the case.

Where I think this gets a lot more interesting is if we could apply this to an index scan. My thought is that would result in one worker mostly being responsible for advancing the index scan itself while the other workers were issuing (and waiting on) heap IO. So even if this doesn't turn out to be a win for seqscan, there's other places we might well want to use it.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#95)
Re: Parallel Seq Scan

On Tue, Jan 20, 2015 at 10:59 PM, Thom Brown <thom@linux.com> wrote:

On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

I'm getting an issue:

# set parallel_seqscan_degree = 10;
SET
Time: 0.219 ms

➤ psql://thom@[local]:5488/pgbench

➤ psql://thom@[local]:5488/pgbench

# explain analyse select c1 from t1;

So setting parallel_seqscan_degree above max_worker_processes causes the

CPU to max out, and the query never returns, or at least not after waiting
2 minutes. Shouldn't it have a ceiling of max_worker_processes?

Yes, it should behave that way, but this is not handled in
patch as still we have to decide on what is the best execution
strategy (block-by-block or fixed chunks for different workers)
and based on that I can handle this scenario in patch.

I could return an error for such a scenario or do some work
to handle it seamlessly, but it seems to me that I have to
rework on the same if we select different approach for doing
execution than used in patch, so I am waiting for that to get
decided. I am planing to work on getting the performance data for
both the approaches, so that we can decide which is better
way to go-ahead.

The original test I performed where I was getting OOM errors now appears

to be fine:

Thanks for confirming the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#98Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#91)
Re: Parallel Seq Scan

On 20-01-2015 PM 11:29, Amit Kapila wrote:

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:
/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

I got an assertion failure:

In src/backend/executor/execTuples.c: ExecStoreTuple()

/* passing shouldFree=true for a tuple on a disk page is not sane */
Assert(BufferIsValid(buffer) ? (!shouldFree) : true);

when called from:

In src/backend/executor/nodeParallelSeqscan.c: ParallelSeqNext()

I think something like the following would be necessary (reading from
comments in the code):

--- a/src/backend/executor/nodeParallelSeqscan.c
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -85,7 +85,7 @@ ParallelSeqNext(ParallelSeqScanState *node)
        if (tuple)
            ExecStoreTuple(tuple,
                           slot,
-                          scandesc->rs_cbuf,
+                          fromheap ? scandesc->rs_cbuf : InvalidBuffer,
                           !fromheap);

After fixing this, the assertion failure seems to be gone though I
observed the blocked (CPU maxed out) state as reported elsewhere by Thom
Brown.

What I was doing:

CREATE TABLE test(a) AS SELECT generate_series(1, 10000000);

postgres=# SHOW max_worker_processes;
max_worker_processes
----------------------
8
(1 row)

postgres=# SET seq_page_cost TO 100;
SET

postgres=# SET parallel_seqscan_degree TO 4;
SET

postgres=# EXPLAIN SELECT * FROM test;
QUERY PLAN
-------------------------------------------------------------------------
Parallel Seq Scan on test (cost=0.00..1801071.27 rows=8981483 width=4)
Number of Workers: 4
Number of Blocks Per Worker: 8849
(3 rows)

Though, EXPLAIN ANALYZE caused the thing.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#98)
Re: Parallel Seq Scan

On Wed, Jan 21, 2015 at 12:47 PM, Amit Langote <
Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 20-01-2015 PM 11:29, Amit Kapila wrote:

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

I got an assertion failure:

In src/backend/executor/execTuples.c: ExecStoreTuple()

/* passing shouldFree=true for a tuple on a disk page is not sane */
Assert(BufferIsValid(buffer) ? (!shouldFree) : true);

Good Catch!
The reason is that while master backend is scanning from a heap
page, if it finds another tuple/tuples's from shared memory message
queue it will process those tuples first and in such a scenario, the scan
descriptor will still have reference to buffer which it is using from
scanning
from heap. Your proposed fix will work.

After fixing this, the assertion failure seems to be gone though I
observed the blocked (CPU maxed out) state as reported elsewhere by Thom
Brown.

Does it happen only when parallel_seqscan_degree > max_worker_processes?

Thanks for checking the patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#100Amit Langote
amitlangote09@gmail.com
In reply to: Amit Kapila (#99)
Re: Parallel Seq Scan

On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 21, 2015 at 12:47 PM, Amit Langote <
Langote_Amit_f8@lab.ntt.co.jp
<javascript:_e(%7B%7D,'cvml','Langote_Amit_f8@lab.ntt.co.jp');>> wrote:

On 20-01-2015 PM 11:29, Amit Kapila wrote:

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

I got an assertion failure:

In src/backend/executor/execTuples.c: ExecStoreTuple()

/* passing shouldFree=true for a tuple on a disk page is not sane */
Assert(BufferIsValid(buffer) ? (!shouldFree) : true);

Good Catch!
The reason is that while master backend is scanning from a heap
page, if it finds another tuple/tuples's from shared memory message
queue it will process those tuples first and in such a scenario, the scan
descriptor will still have reference to buffer which it is using from
scanning
from heap. Your proposed fix will work.

After fixing this, the assertion failure seems to be gone though I
observed the blocked (CPU maxed out) state as reported elsewhere by Thom
Brown.

Does it happen only when parallel_seqscan_degree > max_worker_processes?

I have max_worker_processes set to the default of 8 while
parallel_seqscan_degree is 4. So, this may be a case different from Thom's.

Thanks,
Amit

#101Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#100)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com>
wrote:

On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Does it happen only when parallel_seqscan_degree > max_worker_processes?

I have max_worker_processes set to the default of 8 while

parallel_seqscan_degree is 4. So, this may be a case different from Thom's.

I think this is due to reason that memory for forming
tuple in master backend is retained for longer time which
is causing this statement to take much longer time than
required. I have fixed the other issue as well reported by
you in attached patch.

I think this patch is still not completely ready for general
purpose testing, however it could be helpful if we can run
some tests to see in what kind of scenario's it gives benefit
like in the test you are doing if rather than increasing
seq_page_cost, you should add an expensive WHERE condition
so that it should automatically select parallel plan. I think it is better
to change one of the new parameter's (parallel_setup_cost,
parallel_startup_cost and cpu_tuple_comm_cost) if you want
your statement to use parallel plan, like in your example if
you would have reduced cpu_tuple_comm_cost, it would have
selected parallel plan, that way we can get some feedback about
what should be the appropriate default values for the newly added
parameters. I am already planing to do some tests in that regard,
however if I get some feedback from other's that would be helpful.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v5.patchapplication/octet-stream; name=parallel_seqscan_v5.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..1afac59 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -243,7 +243,19 @@ SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist, int16 *formats)
 				pq_sendint(&buf, 0, 2);
 		}
 	}
-	pq_endmessage(&buf);
+
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 }
 
 /*
@@ -371,7 +383,18 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 		}
 	}
 
-	pq_endmessage(&buf);
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..758d7e8
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,375 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId);
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId);
+
+/*
+ * shm_beginscan -
+ *		Initializes the shared memory scan descriptor to retrieve tuples
+ *		from worker backends. 
+ */
+ShmScanDesc
+shm_beginscan(int num_queues)
+{
+	ShmScanDesc		shmscan;
+
+	shmscan = palloc(sizeof(ShmScanDescData));
+
+	shmscan->num_shm_queues = num_queues;
+	shmscan->ss_cqueue = -1;
+	shmscan->shmscan_inited	= false;
+
+	return shmscan;
+}
+
+/*
+ * ExecInitWorkerResult -
+ *		Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers)
+{
+	worker_result	workerResult;
+	int				i;
+	int	natts = tupdesc->natts;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+	workerResult->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	workerResult->typioparams = palloc(sizeof(Oid) * natts);
+	workerResult->num_shm_queues = nWorkers;
+	workerResult->has_row_description = palloc0(sizeof(bool) * nWorkers);
+	workerResult->queue_detached = palloc0(sizeof(bool) * nWorkers);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &workerResult->typioparams[i]);
+		fmgr_info(receive_function_id, &workerResult->receive_functions[i]);
+	}
+
+	return workerResult;
+}
+
+
+/*
+ * shm_getnext -
+ *		Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+			worker_result resultState, shm_mq_handle **responseq,
+			TupleDesc tupdesc, ScanDirection direction, bool *fromheap)
+{
+	shm_mq_result	res;
+	Size			nbytes;
+	void			*data;
+	StringInfoData	msg;
+	int				queueId = 0;
+
+	/*
+	 * calculate next starting queue used for fetching tuples
+	 */
+	if(!shmScan->shmscan_inited)
+	{
+		shmScan->shmscan_inited = true;
+		Assert(shmScan->num_shm_queues > 0);
+		queueId = 0;
+	}
+	else
+		queueId = shmScan->ss_cqueue;
+
+	/* Read and processes messages from the shared memory queues. */
+	for(;;)
+	{
+		if (!resultState->all_queues_detached)
+		{
+			if (queueId == shmScan->num_shm_queues)
+				queueId = 0;
+
+			/*
+			 * Don't fetch from detached queue.  This loop could continue
+			 * forever, if we reach a situation such that all queue's are
+			 * detached, however we won't reach here if that is the case.
+			 */
+			while (resultState->queue_detached[queueId])
+			{
+				++queueId;
+				if (queueId == shmScan->num_shm_queues)
+					queueId = 0;
+			}
+
+			for (;;)
+			{
+				/*
+				 * mark current queue used for fetching tuples, this is used
+				 * to fetch consecutive tuples from queue used in previous
+				 * fetch.
+				 */
+				shmScan->ss_cqueue = queueId;
+
+				/* Get next message. */
+				res = shm_mq_receive(responseq[queueId], &nbytes, &data, true);
+				if (res == SHM_MQ_DETACHED)
+				{
+					/*
+					 * mark the queue that got detached, so that we don't
+					 * try to fetch from it again.
+					 */
+					resultState->queue_detached[queueId] = true;
+					resultState->has_row_description[queueId] = false;
+					--resultState->num_shm_queues;
+					/*
+					 * if we have exhausted data from all worker queues, then don't
+					 * process data from queues.
+					 */
+					if (resultState->num_shm_queues <= 0)
+						resultState->all_queues_detached = true;
+					break;
+				}
+				else if (res == SHM_MQ_WOULD_BLOCK)
+					break;
+				else if (res == SHM_MQ_SUCCESS)
+				{
+					bool rettuple;
+					initStringInfo(&msg);
+					appendBinaryStringInfo(&msg, data, nbytes);
+					rettuple = HandleParallelTupleMessage(resultState, tupdesc, &msg, queueId);
+					pfree(msg.data);
+					if (rettuple)
+					{
+						*fromheap = false;
+						return resultState->tuple;
+					}
+				}
+			}
+		}
+
+		/*
+		 * if we have checked all the message queue's and didn't find
+		 * any message or we have already fetched all the data from queue's,
+		 * then it's time to fetch directly from heap.  Reset the current
+		 * queue as the first queue from which we need to receive tuples.
+		 */
+		if ((queueId == shmScan->num_shm_queues - 1 ||
+			 resultState->all_queues_detached) &&
+			 !resultState->all_heap_fetched)
+		{
+			HeapTuple	tuple;
+			shmScan->ss_cqueue = 0;
+			tuple = heap_getnext(scanDesc, direction);
+			if (tuple)
+			{
+				*fromheap = true;
+				return tuple;
+			}
+			else if (tuple == NULL && resultState->all_queues_detached)
+				break;
+			else
+				resultState->all_heap_fetched = true;
+		}
+		else if (resultState->all_queues_detached &&
+				 resultState->all_heap_fetched)
+			break;
+
+		/* check the data in next queue. */
+		++queueId;
+	}
+
+	return NULL;
+}
+
+/*
+ * HandleParallelTupleMessage -
+ * Handle a single tuple related protocol message received from
+ * a single parallel worker.
+ */
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId)
+{
+	char	msgtype;
+	bool	rettuple = false;
+
+	msgtype = pq_getmsgbyte(msg);
+
+	/* Dispatch on message type. */
+	switch (msgtype)
+	{
+		case 'T':
+			{
+				int16	natts = pq_getmsgint(msg, 2);
+				int16	i;
+
+				if (resultState->has_row_description[queueId])
+					elog(ERROR, "multiple RowDescription messages");
+				resultState->has_row_description[queueId] = true;
+				if (natts != tupdesc->natts)
+					ereport(ERROR,
+							(errcode(ERRCODE_DATATYPE_MISMATCH),
+								errmsg("worker result rowtype does not match "
+								"the specified FROM clause rowtype")));
+
+				for (i = 0; i < natts; ++i)
+				{
+					Oid		type_id;
+
+					(void) pq_getmsgstring(msg);	/* name */
+					(void) pq_getmsgint(msg, 4);	/* table OID */
+					(void) pq_getmsgint(msg, 2);	/* table attnum */
+					type_id = pq_getmsgint(msg, 4);	/* type OID */
+					(void) pq_getmsgint(msg, 2);	/* type length */
+					(void) pq_getmsgint(msg, 4);	/* typmod */
+					(void) pq_getmsgint(msg, 2);	/* format code */
+
+					if (type_id != tupdesc->attrs[i]->atttypid)
+						ereport(ERROR,
+								(errcode(ERRCODE_DATATYPE_MISMATCH),
+								 errmsg("remote query result rowtype does not match "
+										"the specified FROM clause rowtype")));
+				}
+
+				pq_getmsgend(msg);
+
+				break;
+			}
+		case 'D':
+			{
+				/* Handle DataRow message. */
+				resultState->tuple = form_result_tuple(resultState, tupdesc, msg, queueId);
+				rettuple = true;
+				break;
+			}
+		case 'C':
+			{
+				/*
+					* Handle CommandComplete message. Ignore tags sent by
+					* worker backend as we are anyway going to use tag of
+					* master backend for sending the same to client.
+					*/
+				(void) pq_getmsgstring(msg);
+				break;
+			}
+		case 'G':
+		case 'H':
+		case 'W':
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("COPY protocol not allowed in worker")));
+			}
+		default:
+			elog(WARNING, "unknown message type: %c", msg->data[0]);
+			break;
+	}
+
+	return rettuple;
+}
+
+/*
+ * form_result_tuple -
+ * Parse a DataRow message and form a result tuple.
+ */
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId)
+{
+	/* Handle DataRow message. */
+	int16	natts = pq_getmsgint(msg, 2);
+	int16	i;
+	Datum  *values = NULL;
+	bool   *isnull = NULL;
+	HeapTuple	tuple;
+	StringInfoData	buf;
+
+	if (!resultState->has_row_description[queueId])
+		elog(ERROR, "DataRow not preceded by RowDescription");
+	if (natts != tupdesc->natts)
+		elog(ERROR, "malformed DataRow");
+	if (natts > 0)
+	{
+		values = palloc(natts * sizeof(Datum));
+		isnull = palloc(natts * sizeof(bool));
+	}
+	initStringInfo(&buf);
+
+	for (i = 0; i < natts; ++i)
+	{
+		int32	bytes = pq_getmsgint(msg, 4);
+
+		if (bytes < 0)
+		{
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											NULL,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = true;
+		}
+		else
+		{
+			resetStringInfo(&buf);
+			appendBinaryStringInfo(&buf, pq_getmsgbytes(msg, bytes), bytes);
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											&buf,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = false;
+		}
+	}
+
+	pq_getmsgend(msg);
+
+	tuple = heap_form_tuple(tupdesc, values, isnull);
+
+	/*
+	 * Release locally palloc'd space.  XXX would probably be good to pfree
+	 * values of pass-by-reference datums, as well.
+	 */
+	pfree(values);
+	pfree(isnull);
+
+	pfree(buf.data);
+
+	return tuple;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8a0be5d..bb581a8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -713,6 +713,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -909,6 +910,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_ParallelSeqScan:
+			pname = sname = "Parallel Seq Scan";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1058,6 +1062,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1324,6 +1329,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_ParallelSeqScan:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((ParallelSeqScan *) plan)->num_workers, es);
+			ExplainPropertyInteger("Number of Blocks Per Worker",
+				((ParallelSeqScan *) plan)->num_blocks_per_worker, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2141,6 +2156,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..9a8ca75 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,7 +21,7 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeSeqscan.o nodeParallelSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..f77a77f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeParallelSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +191,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_ParallelSeqScan:
+			result = (PlanState *) ExecInitParallelSeqScan((ParallelSeqScan *) node,
+														   estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +412,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			result = ExecParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +654,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			ExecEndParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 3f0d809..39c624d 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -191,8 +191,17 @@ ExecScan(ScanState *node,
 		 * check for non-nil qual here to avoid a function call to ExecQual()
 		 * when the qual is nil ... saves only a few cycles, but they add up
 		 * ...
+		 *
+		 * check for non-heap tuples (can get such tuples from shared memory
+		 * message queue's in case of parallel query), for such tuples no need
+		 * to perform qualification as for them the same is done by backend
+		 * worker.  This case will happen only for parallel query where we push
+		 * down the qualification.
+		 * XXX - We can do this optimization for projection as well, but for
+		 * now it is okay, as we don't allow parallel query if there are
+		 * expressions involved in target list.
 		 */
-		if (!qual || ExecQual(qual, econtext, false))
+		if (!slot->tts_fromheap || !qual || ExecQual(qual, econtext, false))
 		{
 			/*
 			 * Found a satisfactory scan tuple.
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 753754d..4c5bd88 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -123,6 +123,7 @@ MakeTupleTableSlot(void)
 	slot->tts_values = NULL;
 	slot->tts_isnull = NULL;
 	slot->tts_mintuple = NULL;
+	slot->tts_fromheap	= true;
 
 	return slot;
 }
@@ -473,6 +474,8 @@ ExecClearTuple(TupleTableSlot *slot)	/* slot in which to store tuple */
 	slot->tts_isempty = true;
 	slot->tts_nvalid = 0;
 
+	slot->tts_fromheap = true;
+
 	return slot;
 }
 
diff --git a/src/backend/executor/nodeParallelSeqscan.c b/src/backend/executor/nodeParallelSeqscan.c
new file mode 100644
index 0000000..1855e52
--- /dev/null
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -0,0 +1,318 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeParallelSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeParallelSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecParallelSeqScan				sequentially scans a relation.
+ *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		ExecInitParallelSeqScan			creates and initializes a parallel seqscan node.
+ *		ExecEndParallelSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ParallelSeqNext
+ *
+ *		This is a workhorse for ExecParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ParallelSeqNext(ParallelSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+	bool			fromheap = true;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(scandesc, node->pss_currentShmScanDesc,
+						node->pss_workerResult,
+						node->responseq,
+						node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+						direction, &fromheap);
+
+	slot->tts_fromheap = fromheap;
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass '!fromheap'
+	 * because tuples returned by shm_getnext() are either pointers that are
+	 * created with palloc() or are pointers onto disk pages and so it should
+	 * be pfree()'d accordingly.  Note also that ExecStoreTuple will increment
+	 * the refcount of the buffer; the refcount will not be dropped until the
+	 * tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   fromheap ? scandesc->rs_cbuf : InvalidBuffer, /* buffer associated with this
+																	  * tuple */
+					   !fromheap);	/* pfree this pointer if not from heap */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * ParallelSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+ParallelSeqRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, ParallelSeqScan never use keys in
+	 * shm_beginscan/heap_beginscan (and this is very bad) - so, here
+	 * we do not check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitParallelScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitParallelScanRelation(SeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/* initialize a heapscan */
+	currentScanDesc = heap_beginscan(currentRelation,
+									 estate->es_snapshot,
+									 0,
+									 NULL);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+ParallelSeqScanState *
+ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags)
+{
+	ParallelSeqScanState *parallelscanstate;
+	ShmScanDesc			 currentShmScanDesc;
+	worker_result		 workerResult;
+	BlockNumber			 end_block;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	parallelscanstate = makeNode(ParallelSeqScanState);
+	parallelscanstate->ss.ps.plan = (Plan *) node;
+	parallelscanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &parallelscanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	parallelscanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) parallelscanstate);
+	parallelscanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) parallelscanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &parallelscanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &parallelscanstate->ss);
+	
+	/*
+	 * initialize scan relation
+	 */
+	InitParallelScanRelation(&parallelscanstate->ss, estate, eflags);
+
+	parallelscanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&parallelscanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&parallelscanstate->ss);
+
+	/*
+	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
+	 * here, no need to start workers.
+	 */
+	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+		return parallelscanstate;
+	
+	/* Initialize the workers required to perform parallel scan. */
+	InitiateWorkers(parallelscanstate->ss.ss_currentRelation->rd_id,
+					node->scan.plan.targetlist,
+					node->scan.plan.qual,
+					&parallelscanstate->responseq,
+					&parallelscanstate->pcxt,
+					node->num_blocks_per_worker,
+					node->num_workers);
+
+	/* Initialize the blocks to be scanned by master backend. */
+	end_block = (parallelscanstate->pcxt->nworkers + 1) *
+				node->num_blocks_per_worker;
+	((SeqScan*) parallelscanstate->ss.ps.plan)->startblock =
+								end_block - node->num_blocks_per_worker;
+	/*
+	 * As master backend is the last backend to scan the blocks, it
+	 * should scan all the blocks.
+	 */
+	((SeqScan*) parallelscanstate->ss.ps.plan)->endblock = InvalidBlockNumber;
+
+	/* Set the scan limits for master backend. */
+	heap_setscanlimits(parallelscanstate->ss.ss_currentScanDesc,
+					   ((SeqScan*) parallelscanstate->ss.ps.plan)->startblock,
+					   (parallelscanstate->ss.ss_currentScanDesc->rs_nblocks -
+					   ((SeqScan*) parallelscanstate->ss.ps.plan)->startblock));
+
+	/*
+	 * Use result tuple descriptor to fetch data from shared memory queues
+	 * as the worker backends would have put the data after projection.
+	 * Number of queue's must be equal to number of worker backends.
+	 */
+	currentShmScanDesc = shm_beginscan(parallelscanstate->pcxt->nworkers);
+	workerResult = ExecInitWorkerResult(parallelscanstate->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+										parallelscanstate->pcxt->nworkers);
+
+	parallelscanstate->pss_currentShmScanDesc = currentShmScanDesc;
+	parallelscanstate->pss_workerResult	= workerResult;
+
+	return parallelscanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelSeqScan(node)
+ *
+ *		Scans the relation sequentially from multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecParallelSeqScan(ParallelSeqScanState *node)
+{
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) ParallelSeqNext,
+					(ExecScanRecheckMtd) ParallelSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndParallelSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndParallelSeqScan(ParallelSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	if (node->pcxt)
+	{
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		ExitParallelMode();
+	}
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..5780df0 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -139,6 +139,22 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 									 0,
 									 NULL);
 
+	/*
+	 * set the scan limits, if requested by plan.  If the end block
+	 * is not specified, then scan all the blocks till end.
+	 */
+	if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber &&
+		((SeqScan *) node->ps.plan)->endblock != InvalidBlockNumber)
+		heap_setscanlimits(currentScanDesc,
+						   ((SeqScan *) node->ps.plan)->startblock,
+						   (((SeqScan *) node->ps.plan)->endblock -
+						   ((SeqScan *) node->ps.plan)->startblock));
+	else if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber)
+			 heap_setscanlimits(currentScanDesc,
+								((SeqScan *) node->ps.plan)->startblock,
+								(currentScanDesc->rs_nblocks -
+								((SeqScan *) node->ps.plan)->startblock));
+
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
 
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index f12f2d5..cfab8b5 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -26,6 +26,8 @@ static bool pq_mq_busy = false;
 static pid_t pq_mq_parallel_master_pid = 0;
 static pid_t pq_mq_parallel_master_backend_id = InvalidBackendId;
 
+static shm_mq_handle *pq_mq_tuple_handle = NULL;
+
 static void mq_comm_reset(void);
 static int	mq_flush(void);
 static int	mq_flush_if_writable(void);
@@ -61,6 +63,26 @@ pq_redirect_to_shm_mq(shm_mq *mq, shm_mq_handle *mqh)
 }
 
 /*
+ * Arrange to send some frontend/backend protocol messages to a shared-memory
+ * tuple message queue.
+ */
+void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh)
+{
+	pq_mq_tuple_handle = mqh;
+}
+
+/*
+ * Check if tuples can be sent through tuple shared-memory
+ * message queue.
+ */
+bool
+is_tuple_shm_mq_enabled(void)
+{
+	return pq_mq_tuple_handle ? true : false;
+}
+
+/*
  * Arrange to SendProcSignal() to the parallel master each time we transmit
  * message data via the shm_mq.
  */
@@ -161,6 +183,42 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 	return 0;
 }
 
+/*
+ * Transmit a libpq protocol message to the shared memory message queue
+ * via pq_mq_tuple_handle.  We don't include a length word, because the
+ * receiver will know the length of the message from shm_mq_receive().
+ */
+int
+mq_putmessage_direct(char msgtype, const char *s, size_t len)
+{
+	shm_mq_iovec	iov[2];
+	shm_mq_result	result;
+
+	iov[0].data = &msgtype;
+	iov[0].len = 1;
+	iov[1].data = s;
+	iov[1].len = len;
+
+	Assert(pq_mq_tuple_handle != NULL);
+
+	for (;;)
+	{
+		result = shm_mq_sendv(pq_mq_tuple_handle, iov, 2, true);
+
+		if (result != SHM_MQ_WOULD_BLOCK)
+			break;
+
+		WaitLatch(&MyProc->procLatch, WL_LATCH_SET, 0);
+		CHECK_FOR_INTERRUPTS();
+		ResetLatch(&MyProc->procLatch);
+	}
+
+	Assert(result == SHM_MQ_SUCCESS || result == SHM_MQ_DETACHED);
+	if (result != SHM_MQ_SUCCESS)
+		return EOF;
+	return 0;
+}
+
 static void
 mq_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 020558b..4abfd25 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +227,73 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_parallelseqscan
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+	double		spc_seq_page_cost;
+	QualCost	qpqual_cost;
+	Cost		cpu_per_tuple;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	if (!enable_seqscan)
+		startup_cost += disable_cost;
+
+	/* fetch estimated page cost for tablespace containing table */
+	get_tablespace_page_costs(baserel->reltablespace,
+							  NULL,
+							  &spc_seq_page_cost);
+
+	/*
+	 * disk costs
+	 */
+	run_cost += spc_seq_page_cost * baserel->pages;
+
+	/* CPU costs */
+	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
+
+	startup_cost += qpqual_cost.startup;
+	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+	run_cost += cpu_per_tuple * baserel->tuples;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..5245652
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,126 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+
+
+/*
+ *	IsTargetListContainNonVars -
+ *		Check if target list contain non-var entries.
+ */
+static bool
+IsTargetListContainNonVars(List *targetlist)
+{
+	ListCell   *l;
+
+	foreach(l, targetlist)
+	{
+		TargetEntry *te = (TargetEntry *) lfirst(l);
+
+		if (!IsA(te, TargetEntry))
+			continue;			/* probably should never happen */
+		if (!IsA(te->expr, Var))
+			return true;
+	}
+	return false;
+}
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/*
+	 * parallel scan is not supported for joins.
+	 */
+	if (root->simple_rel_array_size > 2)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	/*
+	 * parallel scan is not supported for non-var target list.
+	 *
+	 * XXX - This is to keep the implementation simple, we can do this
+	 * in future.  Here we are checking by passing root->parse->targetList
+	 * instead of rel->reltargetlist because rel->targetlist always contains
+	 * Vars (refer build_base_rel_tlists).
+	 */
+	if (IsTargetListContainNonVars(root->parse->targetList))
+	   return;
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	add_path(rel, (Path *) create_parallelseqscan_path(root, rel,
+													   num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 655be81..1c7f640 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,9 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_parallelseqscan_plan(PlannerInfo *root,
+										 ParallelSeqPath *best_path,
+										 List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +103,9 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static ParallelSeqScan *make_parallelseqscan(List *qptlist, List *qpqual,
+											 Index scanrelid, int nworkers,
+											 BlockNumber nblocksperworker);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +234,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +350,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_ParallelSeqScan:
+			plan = (Plan *) create_parallelseqscan_plan(root,
+														(ParallelSeqPath *) best_path,
+														tlist,
+														scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -1133,6 +1147,71 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_worker_seqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by worker
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock)
+{
+	SeqScan    *scan_plan;
+
+	/*
+	 * Pass scan_relid as 1, this is okay for now as sequence scan worker
+	 * is allowed to operate on just one relation.
+	 * XXX - we should ideally get scanrelid from master backend.
+	 */
+	scan_plan = make_seqscan(targetList,
+							 scan_clauses,
+							 1);
+
+	scan_plan->startblock = startBlock;
+	scan_plan->endblock = endBlock;
+	return scan_plan;
+}
+
+/*
+ * create_parallelseqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by 'best_path'
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+static Scan *
+create_parallelseqscan_plan(PlannerInfo *root, ParallelSeqPath *best_path,
+					List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->path.param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_parallelseqscan(tlist,
+											  scan_clauses,
+											  scan_relid,
+											  best_path->num_workers,
+											  best_path->num_blocks_per_worker);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3314,6 +3393,30 @@ make_seqscan(List *qptlist,
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->scanrelid = scanrelid;
+	node->startblock = InvalidBlockNumber;
+	node->endblock = InvalidBlockNumber;
+
+	return node;
+}
+
+static ParallelSeqScan *
+make_parallelseqscan(List *qptlist,
+			   List *qpqual,
+			   Index scanrelid,
+			   int nworkers,
+			   BlockNumber nblocksperworker)
+{
+	ParallelSeqScan *node = makeNode(ParallelSeqScan);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+	node->num_blocks_per_worker = nblocksperworker;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9cbbcfb..d2b1621 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,71 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+/*
+ * create_worker_seqscan_plannedstmt
+ *	Returns a planned statement to be used by worker for execution.
+ *	Ideally, master backend should form worker's planned statement
+ *	and pass the same to worker, however for now  master backend
+ *	just passes the required information and PlannedStmt is then
+ *	constructed by worker.
+ */
+PlannedStmt	*
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt)
+{
+	AclMode		required_access = ACL_SELECT;
+	RangeTblEntry *rte;
+	SeqScan    *scan_plan;
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	rte = makeNode(RangeTblEntry);
+	rte->rtekind = RTE_RELATION;
+	rte->relid = workerstmt->relId;
+	rte->relkind = 'r';
+	rte->requiredPerms = required_access;
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) workerstmt->qual);
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, workerstmt->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	scan_plan = create_worker_seqscan_plan(workerstmt->targetList,
+										   workerstmt->qual,
+										   workerstmt->startBlock,
+										   workerstmt->endBlock);
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) scan_plan;
+	result->rtable = list_make1(rte);
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->relationOids = lappend_oid(result->relationOids, rte->relid);;
+	result->invalItems = NIL;
+	result->nParamExec = 0;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 7703946..3a44aef 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -436,6 +436,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..538e612 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,41 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_parallelseqscan_path
+ *	  Creates a path corresponding to a parallel sequential scan, returning the
+ *	  pathnode.
+ */
+ParallelSeqPath *
+create_parallelseqscan_path(PlannerInfo *root, RelOptInfo *rel, int nWorkers)
+{
+	ParallelSeqPath	   *pathnode = makeNode(ParallelSeqPath);
+
+	pathnode->path.pathtype = T_ParallelSeqScan;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->num_workers = nWorkers;
+	/*
+	 * Divide the work equally among all the workers, for cases
+	 * where division is not equal (example if there are total
+	 * 10 blocks and 3 workers, then as per below calculation each
+	 * worker will scan 3 blocks), last worker will be responsible for
+	 * scanning remaining blocks.  We always consider master backend
+	 * as last worker because it will first try to get the tuples
+	 * scanned by other workers.  For calculation of number of blocks
+	 * per worker, an additional worker needs to be consider for
+	 * master backend.
+	 */
+	pathnode->num_blocks_per_worker = rel->pages / (nWorkers + 1);
+
+	cost_parallelseqscan(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..d52d1b6
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,224 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitiateWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/parallel.h"
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PG_WORKER_KEY_RELID			0
+#define PG_WORKER_KEY_TARGETLIST	1
+#define PG_WORKER_KEY_QUAL			2
+#define PG_WORKER_KEY_BLOCKS		3
+#define PARALLEL_KEY_TUPLE_QUEUE	4
+
+static void exec_worker_message(dsm_segment *seg, shm_toc *toc);
+
+/*
+ * InitiateWorkers
+ *		It sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitiateWorkers(Oid relId, List *targetList, List *qual,
+				shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+				BlockNumber numBlocksPerWorker, int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	int			i;
+	Size		targetlist_len, qual_len;
+	BlockNumber	*num_blocks_per_worker;
+	Oid		   *reliddata;
+	char	   *targetlistdata;
+	char	   *targetlist_str;
+	char	   *qualdata;
+	char	   *qual_str;
+	char	   *tuple_queue_space;
+	ParallelContext *pcxt;
+	shm_mq	   *mq;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(exec_worker_message, nWorkers);
+
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(relId));
+
+	targetlist_str = nodeToString(targetList);
+	targetlist_len = strlen(targetlist_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, targetlist_len);
+
+	qual_str = nodeToString(qual);
+	qual_len = strlen(qual_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, qual_len);
+
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(BlockNumber));
+
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * nWorkers);
+
+	/* 5 keys for parallel seq. scan specific data. */
+	shm_toc_estimate_keys(&pcxt->estimator, 5);
+
+	InitializeParallelDSM(pcxt);
+
+	/* Store scan relation id in dynamic shared memory. */
+	reliddata = shm_toc_allocate(pcxt->toc, sizeof(Oid));
+	*reliddata = relId;
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_RELID, reliddata);
+
+	/* Store target list in dynamic shared memory. */
+	targetlistdata = shm_toc_allocate(pcxt->toc, targetlist_len);
+	memcpy(targetlistdata, targetlist_str, targetlist_len);
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_TARGETLIST, targetlistdata);
+
+	/* Store qual list in dynamic shared memory. */
+	qualdata = shm_toc_allocate(pcxt->toc, qual_len);
+	memcpy(qualdata, qual_str, qual_len);
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_QUAL, qualdata);
+
+	/* Store blocks to be scanned by each worker in dynamic shared memory. */
+	num_blocks_per_worker = shm_toc_allocate(pcxt->toc, sizeof(BlockNumber));
+	*num_blocks_per_worker = numBlocksPerWorker;
+	shm_toc_insert(pcxt->toc, PG_WORKER_KEY_BLOCKS, num_blocks_per_worker);
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(nWorkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data. 
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+
+	/* Register backend workers. */
+	LaunchParallelWorkers(pcxt);
+
+	for (i = 0; i < pcxt->nworkers; ++i)
+		shm_mq_set_handle((*responseqp)[i], pcxt->worker[i].bgwhandle);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+
+/*
+ * exec_worker_message
+ *
+ * Execute the work assigned to a worker by master backend.
+ */
+void
+exec_worker_message(dsm_segment *seg, shm_toc *toc)
+{
+	char	    *targetlistdata;
+	char		*qualdata;
+	char		*tuple_queue_space;
+	BlockNumber *num_blocks_per_worker;
+	BlockNumber  start_block;
+	BlockNumber  end_block;
+	shm_mq	    *mq;
+	shm_mq_handle *responseq;
+	Oid			*relId;
+	List		*targetList = NIL;
+	List		*qual = NIL;
+	worker_stmt	*workerstmt;
+	
+	relId = shm_toc_lookup(toc, PG_WORKER_KEY_RELID);
+	targetlistdata = shm_toc_lookup(toc, PG_WORKER_KEY_TARGETLIST);
+	qualdata = shm_toc_lookup(toc, PG_WORKER_KEY_QUAL);
+	num_blocks_per_worker = shm_toc_lookup(toc, PG_WORKER_KEY_BLOCKS);
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(mq, MyProc);
+	responseq = shm_mq_attach(mq, seg, NULL);
+
+	end_block = (ParallelWorkerNumber + 1) * (*num_blocks_per_worker);
+	start_block = end_block - (*num_blocks_per_worker);
+
+	/* Redirect protocol messages to responseq. */
+	pq_redirect_to_tuple_shm_mq(responseq);
+
+	/* Restore targetList and qual passed by main backend. */
+	targetList = (List *) stringToNode(targetlistdata);
+	qual = (List *) stringToNode(qualdata);
+
+	workerstmt = palloc(sizeof(worker_stmt));
+
+	workerstmt->relId = *relId;
+	workerstmt->targetList = targetList;
+	workerstmt->qual = qual;
+	workerstmt->startBlock = start_block;
+
+	/*
+	 * Last worker should scan all the remaining blocks.
+	 *
+	 * XXX - It is possible that expected number of workers
+	 * won't get started, so to handle such cases master
+	 * backend should scan remaining blocks.
+	 */
+	workerstmt->endBlock = end_block;
+
+	/* Execute the worker command. */
+	exec_worker_stmt(workerstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 47ed84c..994eeba 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..da6e099 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -148,10 +148,19 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestRemoteExecute:
 
 			/*
-			 * We assume the commandTag is plain ASCII and therefore requires
-			 * no encoding conversion.
+			 * Send the message via shared-memory tuple queue, if the same
+			 * is enabled.
 			 */
-			pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			if (is_tuple_shm_mq_enabled())
+				mq_putmessage_direct('C', commandTag, strlen(commandTag) + 1);
+			else
+			{
+				/*
+				 * We assume the commandTag is plain ASCII and therefore requires
+				 * no encoding conversion.
+				 */
+				pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			}
 			break;
 
 		case DestNone:
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index bbad0dc..411f150 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -55,6 +55,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1132,6 +1133,100 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * execute_worker_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_worker_stmt(worker_stmt *workerstmt)
+{
+	Portal		portal;
+	int16		format = 1;
+	DestReceiver *receiver;
+	bool		isTopLevel = true;
+	PlannedStmt	*planned_stmt;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+	BeginCommand("SELECT", DestNone);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	planned_stmt = create_worker_seqscan_plannedstmt(workerstmt);
+	/*
+	 * Create unnamed portal to run the query or queries in. If there
+	 * already is one, silently drop it.
+	 */
+	portal = CreatePortal("", true, true);
+	/* Don't display the portal in pg_cursors */
+	portal->visible = false;
+
+	/*
+	 * We don't have to copy anything into the portal, because everything
+	 * we are passing here is in MessageContext, which will outlive the
+	 * portal anyway.
+	 */
+	PortalDefineQuery(portal,
+					  NULL,
+					  "",
+					  "",
+					  list_make1(planned_stmt),
+					  NULL);
+
+	/*
+	 * Start the portal.  No parameters here.
+	 */
+	PortalStart(portal, NULL, 0, InvalidSnapshot);
+
+	/* We always use binary format, for efficiency. */
+	PortalSetResultFormat(portal, 1, &format);
+
+	receiver = CreateDestReceiver(DestRemote);
+	SetRemoteDestReceiverParams(receiver, portal);
+
+	/*
+	 * Only once the portal and destreceiver have been established can
+	 * we return to the transaction context.  All that stuff needs to
+	 * survive an internal commit inside PortalRun!
+	 */
+	MemoryContextSwitchTo(oldcontext);
+
+	/*
+	 * Run the portal to completion, and then drop it (and the receiver).
+	 */
+	(void) PortalRun(portal,
+					 FETCH_ALL,
+					 isTopLevel,
+					 receiver,
+					 receiver,
+					 NULL);
+
+	(*receiver->rDestroy) (receiver);
+
+	PortalDrop(portal, false);
+
+	/*
+	 * Send appropriate CommandComplete to client.  There is no
+	 * need to send completion tag from worker as that won't be
+	 * of any use considering the completiong tag of master backend
+	 * will be used for sending to client.
+	 */
+	EndCommand("", DestRemote);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d9bfa25..b8f90b7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -630,6 +630,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2445,6 +2447,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2632,6 +2644,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..784cfe0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -497,6 +500,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 761ba1f..00ad468 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -45,6 +45,8 @@ typedef struct ParallelContext
 
 extern bool ParallelMessagePending;
 
+extern int ParallelWorkerNumber;
+
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExtension(char *library_name,
 								  char *function_name, int nworkers);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9bb6362..3c56b49 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -105,4 +105,13 @@ typedef struct SysScanDescData
 	Snapshot	snapshot;		/* snapshot to unregister at end of scan */
 }	SysScanDescData;
 
+/* struct for scanning shared memory queues */
+typedef struct ShmScanDescData
+{
+	/* scan current state */
+	int			num_shm_queues;	/* number of shared memory queues used in scan. */
+	int			ss_cqueue;		/* current queue # in scan, if any */
+	bool		shmscan_inited;		/* false = scan not init'd yet */
+}	ShmScanDescData;
+
 #endif   /* RELSCAN_H */
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..df56cfe
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,44 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	FmgrInfo   *receive_functions;
+	Oid		   *typioparams;
+	HeapTuple  tuple;
+	int		   num_shm_queues;
+	bool	   *has_row_description;
+	bool	   *queue_detached;
+	bool	   all_queues_detached;
+	bool	   all_heap_fetched;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+typedef struct ShmScanDescData *ShmScanDesc;
+
+extern worker_result ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers);
+extern ShmScanDesc shm_beginscan(int num_queues);
+extern HeapTuple shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+							 worker_result resultState, shm_mq_handle **responseq,
+							 TupleDesc tupdesc, ScanDirection direction, bool *fromheap);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/nodeParallelSeqscan.h b/src/include/executor/nodeParallelSeqscan.h
new file mode 100644
index 0000000..b638a24
--- /dev/null
+++ b/src/include/executor/nodeParallelSeqscan.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeparallelSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeParallelSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARALLELSEQSCAN_H
+#define NODEPARALLELSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern ParallelSeqScanState *ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecParallelSeqScan(ParallelSeqScanState *node);
+extern void ExecEndParallelSeqScan(ParallelSeqScanState *node);
+
+extern Size EstimateScanRelationIdSpace(Oid relId);
+extern void SerializeScanRelationId(Oid relId, Size maxsize,
+									char *start_address);
+extern void RestoreScanRelationId(Oid *relId, char *start_address);
+
+extern Size EstimateTargetListSpace(List *targetList);
+extern void SerializeTargetList(List *targetList, Size maxsize,
+								char *start_address);
+extern void RestoreTargetList(List **targetList, char *start_address);
+
+#endif   /* NODEPARALLELSEQSCAN_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 48f84bf..e5dec1e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -127,6 +127,8 @@ typedef struct TupleTableSlot
 	MinimalTuple tts_mintuple;	/* minimal tuple, or NULL if none */
 	HeapTupleData tts_minhdr;	/* workspace for minimal-tuple-only case */
 	long		tts_off;		/* saved state for slot_deform_tuple */
+	bool		tts_fromheap;	/* indicates whether the tuple is fetched from
+								   heap or shrared memory message queue */
 } TupleTableSlot;
 
 #define TTS_HAS_PHYSICAL_TUPLE(slot)  \
diff --git a/src/include/libpq/pqmq.h b/src/include/libpq/pqmq.h
index ad7589d..067edbe 100644
--- a/src/include/libpq/pqmq.h
+++ b/src/include/libpq/pqmq.h
@@ -19,6 +19,13 @@
 extern void	pq_redirect_to_shm_mq(shm_mq *, shm_mq_handle *);
 extern void pq_set_parallel_master(pid_t pid, BackendId backend_id);
 
+extern int
+mq_putmessage_direct(char msgtype, const char *s, size_t len);
+extern void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh);
+extern bool
+is_tuple_shm_mq_enabled(void);
+
 extern void pq_parse_errornotice(StringInfo str, ErrorData *edata);
 
 #endif   /* PQMQ_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41288ed..86f4731 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,12 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -1212,6 +1215,23 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * ParallelScanState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct ParallelScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle **responseq;
+	ShmScanDesc pss_currentShmScanDesc;
+	worker_result	pss_workerResult;
+} ParallelScanState;
+
+typedef ParallelScanState ParallelSeqScanState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 97ef0fc..b6f1493 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_ParallelSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +98,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_ParallelSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +219,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_ParallelSeqPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b1dfa85..5777271 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -23,6 +23,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +157,15 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for execution. */
+typedef struct worker_stmt
+{
+	Oid			relId;
+	List		*targetList;
+	List		*qual;
+	BlockNumber startBlock;
+	BlockNumber endBlock;
+} worker_stmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 316c9ce..3354398 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,7 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -269,6 +270,8 @@ typedef struct Scan
 {
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+	BlockNumber startblock;		/* block to start seq scan */
+	BlockNumber endblock;		/* block upto which scan has to be done */
 } Scan;
 
 /* ----------------
@@ -278,6 +281,17 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct ParallelSeqScan
+{
+	Scan		scan;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqScan;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..576add5 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct ParallelSeqPath
+{
+	Path		path;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..0b6a469 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..32c3e0d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern ParallelSeqPath *create_parallelseqscan_path(PlannerInfo *root,
+					RelOptInfo *rel, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 082f7d7..ef5a320 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -41,6 +41,9 @@ extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  * prototypes for plan/createplan.c
  */
 extern Plan *create_plan(PlannerInfo *root, Path *best_path);
+extern SeqScan *
+create_worker_seqscan_plan(List *targetList, List *scan_clauses,
+						   BlockNumber startBlock, BlockNumber endBlock);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..91ddffe 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..8813b6d
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+extern int	parallel_seqscan_degree;
+extern void InitiateWorkers(Oid relId, List *targetList,
+							List *qual,
+							shm_mq_handle ***responseqp,
+							ParallelContext **pcxtp,
+							BlockNumber numBlocksPerWorker,
+							int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 0a350fd..02cf518 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -83,5 +83,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_worker_stmt(worker_stmt *workerstmt);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#102Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Amit Kapila (#101)
Re: Parallel Seq Scan

On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com>
wrote:

On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Does it happen only when parallel_seqscan_degree > max_worker_processes?

I have max_worker_processes set to the default of 8 while

parallel_seqscan_degree is 4. So, this may be a case different from Thom's.

I think this is due to reason that memory for forming tuple in master backend
is retained for longer time which is causing this statement to take much
longer time than required. I have fixed the other issue as well reported
by you in attached patch.

I think this patch is still not completely ready for general purpose testing,
however it could be helpful if we can run some tests to see in what kind
of scenario's it gives benefit like in the test you are doing if rather
than increasing seq_page_cost, you should add an expensive WHERE condition
so that it should automatically select parallel plan. I think it is better
to change one of the new parameter's (parallel_setup_cost,
parallel_startup_cost and cpu_tuple_comm_cost) if you want your statement
to use parallel plan, like in your example if you would have reduced
cpu_tuple_comm_cost, it would have selected parallel plan, that way we can
get some feedback about what should be the appropriate default values for
the newly added parameters. I am already planing to do some tests in that
regard, however if I get some feedback from other's that would be helpful.

(Please point out me if my understanding is incorrect.)

What happen if dynamic background worker process tries to reference temporary
tables? Because buffer of temporary table blocks are allocated on private
address space, its recent status is not visible to other process unless it is
not flushed to the storage every time.

Do we need to prohibit create_parallelscan_paths() to generate a path when
target relation is temporary one?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Amit Kapila
amit.kapila16@gmail.com
In reply to: Kouhei Kaigai (#102)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 6:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

(Please point out me if my understanding is incorrect.)

What happen if dynamic background worker process tries to reference

temporary

tables? Because buffer of temporary table blocks are allocated on private
address space, its recent status is not visible to other process unless

it is

not flushed to the storage every time.

Do we need to prohibit create_parallelscan_paths() to generate a path when
target relation is temporary one?

Yes, we need to prohibit parallel scans on temporary relations. Will fix.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#104Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#101)
Re: Parallel Seq Scan

On 21-01-2015 PM 09:43, Amit Kapila wrote:

On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com>
wrote:

On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Does it happen only when parallel_seqscan_degree > max_worker_processes?

I have max_worker_processes set to the default of 8 while

parallel_seqscan_degree is 4. So, this may be a case different from Thom's.

I think this is due to reason that memory for forming
tuple in master backend is retained for longer time which
is causing this statement to take much longer time than
required. I have fixed the other issue as well reported by
you in attached patch.

Thanks for fixing.

I think this patch is still not completely ready for general
purpose testing, however it could be helpful if we can run
some tests to see in what kind of scenario's it gives benefit
like in the test you are doing if rather than increasing
seq_page_cost, you should add an expensive WHERE condition
so that it should automatically select parallel plan. I think it is better
to change one of the new parameter's (parallel_setup_cost,
parallel_startup_cost and cpu_tuple_comm_cost) if you want
your statement to use parallel plan, like in your example if
you would have reduced cpu_tuple_comm_cost, it would have
selected parallel plan, that way we can get some feedback about
what should be the appropriate default values for the newly added
parameters. I am already planing to do some tests in that regard,
however if I get some feedback from other's that would be helpful.

Perhaps you are aware or you've postponed working on it, but I see that
a plan executing in a worker does not know about instrumentation. It
results in the EXPLAIN ANALYZE showing incorrect figures. For example
compare the normal seqscan and parallel seqscan below:

postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE sqrt(a) < 3456 AND
md5(a::text) LIKE 'ac%';
QUERY PLAN

---------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..310228.52 rows=16120 width=4) (actual
time=0.497..17062.436 rows=39028 loops=1)
Filter: ((sqrt((a)::double precision) < 3456::double precision) AND
(md5((a)::text) ~~ 'ac%'::text))
Rows Removed by Filter: 9960972
Planning time: 0.206 ms
Execution time: 17378.413 ms
(5 rows)

postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE sqrt(a) < 3456 AND
md5(a::text) LIKE 'ac%';
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
Parallel Seq Scan on test (cost=0.00..255486.08 rows=16120 width=4)
(actual time=7.329..4906.981 rows=39028 loops=1)
Filter: ((sqrt((a)::double precision) < 3456::double precision) AND
(md5((a)::text) ~~ 'ac%'::text))
Rows Removed by Filter: 1992710
Number of Workers: 4
Number of Blocks Per Worker: 8849
Planning time: 0.137 ms
Execution time: 6077.782 ms
(7 rows)

Note the "Rows Removed by Filter". I guess the difference may be
because, all the rows filtered by workers were not accounted for. I'm
not quite sure, but since exec_worker_stmt goes the Portal way,
QueryDesc.instrument_options remains unset and hence no instrumentation
opportunities in a worker backend. One option may be to pass
instrument_options down to worker_stmt?

By the way, 17s and 6s compare really well in favor of parallel seqscan
above, :)

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#1)
Re: Parallel Seq Scan

On 22-01-2015 PM 02:30, Amit Kapila wrote:

Perhaps you are aware or you've postponed working on it, but I see that
a plan executing in a worker does not know about instrumentation.

I have deferred it until other main parts are stabilised/reviewed. Once
that is done, we can take a call what is best we can do for instrumentation.
Thom has reported the same as well upthread.

Ah, I missed Thom's report.

Note the "Rows Removed by Filter". I guess the difference may be
because, all the rows filtered by workers were not accounted for. I'm
not quite sure, but since exec_worker_stmt goes the Portal way,
QueryDesc.instrument_options remains unset and hence no instrumentation
opportunities in a worker backend. One option may be to pass
instrument_options down to worker_stmt?

I think there is more to it, master backend need to process that information
as well.

I see.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#104)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 10:44 AM, Amit Langote <
Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 21-01-2015 PM 09:43, Amit Kapila wrote:

On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com>
wrote:

On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Does it happen only when parallel_seqscan_degree >

max_worker_processes?

I have max_worker_processes set to the default of 8 while

parallel_seqscan_degree is 4. So, this may be a case different from

Thom's.

I think this is due to reason that memory for forming
tuple in master backend is retained for longer time which
is causing this statement to take much longer time than
required. I have fixed the other issue as well reported by
you in attached patch.

Thanks for fixing.

I think this patch is still not completely ready for general
purpose testing, however it could be helpful if we can run
some tests to see in what kind of scenario's it gives benefit
like in the test you are doing if rather than increasing
seq_page_cost, you should add an expensive WHERE condition
so that it should automatically select parallel plan. I think it is

better

to change one of the new parameter's (parallel_setup_cost,
parallel_startup_cost and cpu_tuple_comm_cost) if you want
your statement to use parallel plan, like in your example if
you would have reduced cpu_tuple_comm_cost, it would have
selected parallel plan, that way we can get some feedback about
what should be the appropriate default values for the newly added
parameters. I am already planing to do some tests in that regard,
however if I get some feedback from other's that would be helpful.

Perhaps you are aware or you've postponed working on it, but I see that
a plan executing in a worker does not know about instrumentation.

I have deferred it until other main parts are stabilised/reviewed. Once
that is done, we can take a call what is best we can do for instrumentation.
Thom has reported the same as well upthread.

Note the "Rows Removed by Filter". I guess the difference may be
because, all the rows filtered by workers were not accounted for. I'm
not quite sure, but since exec_worker_stmt goes the Portal way,
QueryDesc.instrument_options remains unset and hence no instrumentation
opportunities in a worker backend. One option may be to pass
instrument_options down to worker_stmt?

I think there is more to it, master backend need to process that information
as well.

By the way, 17s and 6s compare really well in favor of parallel seqscan
above, :)

That sounds interesting.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#107Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#90)
2 attachment(s)
Re: Parallel Seq Scan

On Mon, Jan 19, 2015 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 19, 2015 at 2:24 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Another thing is that I think prefetching is not supported on all

platforms

(Windows) and for such systems as per above algorithm we need to
rely on block-by-block method.

Well, I think we should try to set up a test to see if this is hurting
us. First, do a sequential-scan of a related too big at least twice
as large as RAM. Then, do a parallel sequential scan of the same
relation with 2 workers. Repeat these in alternation several times.
If the operating system is accomplishing meaningful readahead, and the
parallel sequential scan is breaking it, then since the test is
I/O-bound I would expect to see the parallel scan actually being
slower than the normal way.

I have taken some performance data as per above discussion. Basically,
I have used parallel_count module which is part of parallel-mode patch
as that seems to be more close to verify the I/O pattern (doesn't have any
tuple communication overhead).

Script used to test is attached (parallel_count.sh)

Performance Data
----------------------------
Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB

Table Size - 120GB

Used below statements to create table -
create table tbl_perf(c1 int, c2 char(1000));
insert into tbl_perf values(generate_series(1,10000000),'aaaaa');
insert into tbl_perf values(generate_series(10000001,30000000),'aaaaa');
insert into tbl_perf values(generate_series(30000001,110000000),'aaaaa');

*Block-By-Block*

*No. of workers/Time (ms)* *0* *2* Run-1 267798 295051 Run-2 276646
296665 Run-3 281364 314952 Run-4 290231 326243 Run-5 288890 295684

Then I have modified the parallel_count module such that it can scan in
fixed chunks, rather than block-by-block, the patch for same is attached
(parallel_count_fixed_chunk_v1.patch, this is a patch based on parallel
count module in parallel-mode patch [1]/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com).

*Fixed-Chunks*

*No. of workers/Time (ms)* *0* *2*
286346 234037
250051 215111
255915 254934
263754 242228
251399 202581

Observations
------------------------
1. Scanning block-by-block has negative impact on performance and
I thin it will degrade more if we increase parallel count as that can lead
to more randomness.
2. Scanning in fixed chunks improves the performance. Increasing
parallel count to a very large number might impact the performance,
but I think we can have a lower bound below which we will not allow
multiple processes to scan the relation.

Now I can go-ahead and try with prefetching approach as suggested
by you, but I have a feeling that overall it might not be beneficial (mainly
due to the reason that it is not supported on all platforms, we can say
that we don't care for such platforms, but still there is no mitigation
strategy
for those platforms) due to the reasons mentioned up-thread.

Thoughts?

[1]: /messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com
/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_count.shapplication/x-sh; name=parallel_count.shDownload
parallel_count_fixed_chunk_v1.patchapplication/octet-stream; name=parallel_count_fixed_chunk_v1.patchDownload
diff --git a/contrib/parallel_dummy/parallel_dummy.c b/contrib/parallel_dummy/parallel_dummy.c
index 0a32ea8..0b75694 100644
--- a/contrib/parallel_dummy/parallel_dummy.c
+++ b/contrib/parallel_dummy/parallel_dummy.c
@@ -43,8 +43,13 @@ typedef struct
 	BlockNumber	lastblock;
 	BlockNumber	currentblock;
 	int64		ntuples;
+	int			workers_attached;
+	int			workers_expected;
+	BlockNumber num_blocks_per_worker;
 } ParallelCountInfo;
 
+int ParallelWorkerNumber;
+
 void		_PG_init(void);
 void		sleep_worker_main(dsm_segment *seg, shm_toc *toc);
 void		count_worker_main(dsm_segment *seg, shm_toc *toc);
@@ -122,6 +127,9 @@ parallel_count(PG_FUNCTION_ARGS)
 	info->lastblock = RelationGetNumberOfBlocks(rel);
 	info->currentblock = 0;
 	info->ntuples = 0;
+	info->workers_attached = 0;
+	info->workers_expected = nworkers;
+	info->num_blocks_per_worker = info->lastblock / (nworkers + 1);
 	shm_toc_insert(pcxt->toc, PARALLEL_DUMMY_KEY, info);
 	LaunchParallelWorkers(pcxt);
 
@@ -175,32 +183,35 @@ count_helper(Relation rel, ParallelCountInfo *info)
 	int64		mytuples = 0;
 	Oid			relid = info->relid;
 	Snapshot	snapshot = GetActiveSnapshot();
-
-	for (;;)
+	BlockNumber blkno;
+	BlockNumber end_block;
+	BlockNumber start_block;
+	Buffer		buffer;
+	Page		page;
+	int			lines;
+	OffsetNumber	lineoff;
+	ItemId		lpp;
+	bool		all_visible;
+
+
+	SpinLockAcquire(&info->mutex);
+	ParallelWorkerNumber = info->workers_attached++;
+	SpinLockRelease(&info->mutex);
+
+	end_block = (ParallelWorkerNumber + 1) * info->num_blocks_per_worker;
+	start_block = end_block - info->num_blocks_per_worker;
+
+	/*
+	 * Last worker is responsible for scanning all the remaining
+	 * blocks in relation.
+	 */
+	if (ParallelWorkerNumber == info->workers_expected)
+		end_block = info->lastblock;
+
+	for (blkno = start_block; blkno < end_block; blkno++)
 	{
-		BlockNumber blkno;
-		Buffer		buffer;
-		Page		page;
-		int			lines;
-		OffsetNumber	lineoff;
-		ItemId		lpp;
-		bool		all_visible;
-		bool		done = false;
-
 		CHECK_FOR_INTERRUPTS();
 
-		SpinLockAcquire(&info->mutex);
-		if (info->currentblock >= info->lastblock)
-			done = true;
-		else
-			blkno = info->currentblock++;
-		info->ntuples += ntuples;
-		SpinLockRelease(&info->mutex);
-
-		mytuples += ntuples;
-		if (done)
-			break;
-
 		buffer = ReadBuffer(rel, blkno);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 		page = BufferGetPage(buffer);
@@ -210,8 +221,8 @@ count_helper(Relation rel, ParallelCountInfo *info)
 		all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
 
 		for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(page, lineoff);
-			 lineoff <= lines;
-			 lineoff++, lpp++)
+				lineoff <= lines;
+				lineoff++, lpp++)
 		{
 			HeapTupleData	loctup;
 
@@ -232,6 +243,12 @@ count_helper(Relation rel, ParallelCountInfo *info)
 		}
 
 		UnlockReleaseBuffer(buffer);
+
+		SpinLockAcquire(&info->mutex);
+		info->ntuples += ntuples;
+		SpinLockRelease(&info->mutex);
+
+		mytuples += ntuples;
 	}
 
 	elog(NOTICE, "PID %d counted " INT64_FORMAT " tuples", MyProcPid, mytuples);
#108Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#107)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

1. Scanning block-by-block has negative impact on performance and
I thin it will degrade more if we increase parallel count as that can lead
to more randomness.

2. Scanning in fixed chunks improves the performance. Increasing
parallel count to a very large number might impact the performance,
but I think we can have a lower bound below which we will not allow
multiple processes to scan the relation.

I'm confused. Your actual test numbers seem to show that the
performance with the block-by-block approach was slightly higher with
parallelism than without, where as the performance with the
chunk-by-chunk approach was lower with parallelism than without, but
the text quoted above, summarizing those numbers, says the opposite.

Also, I think testing with 2 workers is probably not enough. I think
we should test with 8 or even 16.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#108)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

1. Scanning block-by-block has negative impact on performance and
I thin it will degrade more if we increase parallel count as that can

lead

to more randomness.

2. Scanning in fixed chunks improves the performance. Increasing
parallel count to a very large number might impact the performance,
but I think we can have a lower bound below which we will not allow
multiple processes to scan the relation.

I'm confused. Your actual test numbers seem to show that the
performance with the block-by-block approach was slightly higher with
parallelism than without, where as the performance with the
chunk-by-chunk approach was lower with parallelism than without, but
the text quoted above, summarizing those numbers, says the opposite.

Sorry for causing confusion, I should have been more explicit about
explaining the numbers. Let me try again,
Values in columns is time in milliseconds to complete the execution,
so higher means it took more time. If you see in block-by-block, the
time taken to complete the execution with 2 workers is more than
no workers which means parallelism has degraded the performance.

Also, I think testing with 2 workers is probably not enough. I think
we should test with 8 or even 16.

Sure, will do this and post the numbers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#110Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#109)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 9:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I'm confused. Your actual test numbers seem to show that the
performance with the block-by-block approach was slightly higher with
parallelism than without, where as the performance with the
chunk-by-chunk approach was lower with parallelism than without, but
the text quoted above, summarizing those numbers, says the opposite.

Sorry for causing confusion, I should have been more explicit about
explaining the numbers. Let me try again,
Values in columns is time in milliseconds to complete the execution,
so higher means it took more time. If you see in block-by-block, the
time taken to complete the execution with 2 workers is more than
no workers which means parallelism has degraded the performance.

*facepalm*

Oh, yeah, right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Josh Berkus
josh@agliodbs.com
In reply to: Amit Kapila (#8)
Re: Parallel Seq Scan

On 01/22/2015 05:53 AM, Robert Haas wrote:

Also, I think testing with 2 workers is probably not enough. I think
we should test with 8 or even 16.

FWIW, based on my experience there will also be demand to use parallel
query using 4 workers, particularly on AWS.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#108)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

1. Scanning block-by-block has negative impact on performance and
I thin it will degrade more if we increase parallel count as that can

lead

to more randomness.

2. Scanning in fixed chunks improves the performance. Increasing
parallel count to a very large number might impact the performance,
but I think we can have a lower bound below which we will not allow
multiple processes to scan the relation.

I'm confused. Your actual test numbers seem to show that the
performance with the block-by-block approach was slightly higher with
parallelism than without, where as the performance with the
chunk-by-chunk approach was lower with parallelism than without, but
the text quoted above, summarizing those numbers, says the opposite.

Also, I think testing with 2 workers is probably not enough. I think
we should test with 8 or even 16.

Below is the data with more number of workers, the amount of data and
other configurations remains as previous, I have only increased parallel
worker count:

*Block-By-Block*

*No. of workers/Time (ms)* *0* *2* *4* *8* *16* *24* *32* Run-1 257851
287353 350091 330193 284913 338001 295057 Run-2 263241 314083 342166 347337
378057 351916 348292 Run-3 315374 334208 389907 340327 328695 330048 330102
Run-4 301054 312790 314682 352835 323926 324042 302147 Run-5 304547 314171
349158 350191 350468 341219 281315

*Fixed-Chunks*

*No. of workers/Time (ms)* *0* *2* *4* *8* *16* *24* *32* Run-1 250536
266279 251263 234347 87930 50474 35474 Run-2 249587 230628 225648 193340
83036 35140 9100 Run-3 234963 220671 230002 256183 105382 62493 27903
Run-4 239111 245448 224057 189196 123780 63794 24746 Run-5 239937 222820
219025 220478 114007 77965 39766

The trend remains same although there is some variation.
In block-by-block approach, it performance dips (execution takes
more time) with more number of workers, though it stabilizes at
some higher value, still I feel it is random as it leads to random
scan.
In Fixed-chunk approach, the performance improves with more
number of workers especially at slightly higher worker count.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#113Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Amit Kapila (#112)
Re: Parallel Seq Scan

On 1/23/15 5:42 AM, Amit Kapila wrote:

*Fixed-Chunks*
*No. of workers/Time (ms)*
*0* *2* *4* *8* *16* *24* *32*
Run-1 250536 266279 251263 234347 87930 50474 35474
Run-2 249587 230628 225648 193340 83036 35140 9100
Run-3 234963 220671 230002 256183 105382 62493 27903
Run-4 239111 245448 224057 189196 123780 63794 24746
Run-5 239937 222820 219025 220478 114007 77965 39766

The trend remains same although there is some variation.
In block-by-block approach, it performance dips (execution takes
more time) with more number of workers, though it stabilizes at
some higher value, still I feel it is random as it leads to random
scan.
In Fixed-chunk approach, the performance improves with more
number of workers especially at slightly higher worker count.

Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time by another 1/2 to 1/3?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Joshua D. Drake
jd@commandprompt.com
In reply to: Jim Nasby (#113)
Re: Parallel Seq Scan

On 01/23/2015 10:44 AM, Jim Nasby wrote:

number of workers especially at slightly higher worker count.

Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no
difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time
by another 1/2 to 1/3?

cached? First couple of runs gets the relations into memory?

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
not be surprised when they come back as Romans."

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115Amit Kapila
amit.kapila16@gmail.com
In reply to: Joshua D. Drake (#114)
Re: Parallel Seq Scan

On Sat, Jan 24, 2015 at 12:24 AM, Joshua D. Drake <jd@commandprompt.com>
wrote:

On 01/23/2015 10:44 AM, Jim Nasby wrote:

number of workers especially at slightly higher worker count.

Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no
difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time
by another 1/2 to 1/3?

There is variation in tests at different worker count but there is
definitely improvement from 0 to 2 worker count (if you refer my
initial mail on this data, with 2 workers there is a benefit of ~20%)
and I think we run the tests in a similar way (like compare 0 and 2
or 0 or 4 or 0 and 8), then the other effects could be minimised and
we might see better consistency, however the general trend with
fixed-chunk seems to be that scanning that way is better.

I think the real benefit with the current approach/patch can be seen
with qualifications (especially costly expression evaluation).

Further, if we want to just get the benefit of parallel I/O, then
I think we can get that by parallelising partition scan where different
table partitions reside on different disk partitions, however that is
a matter of separate patch.

cached? First couple of runs gets the relations into memory?

Not entirely, as the table size is double than RAM, so each run
has to perform I/O.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#116Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Amit Kapila (#115)
Re: Parallel Seq Scan

On 1/23/15 10:16 PM, Amit Kapila wrote:

Further, if we want to just get the benefit of parallel I/O, then
I think we can get that by parallelising partition scan where different
table partitions reside on different disk partitions, however that is
a matter of separate patch.

I don't think we even have to go that far.

My experience with Postgres is that it is *very* sensitive to IO latency (not bandwidth). I believe this is the case because complex queries tend to interleave CPU intensive code in-between IO requests. So we see this pattern:

Wait 5ms on IO
Compute for a few ms
Wait 5ms on IO
Compute for a few ms
...

We blindly assume that the kernel will magically do read-ahead for us, but I've never seen that work so great. It certainly falls apart on something like an index scan.

If we could instead do this:

Wait for first IO, issue second IO request
Compute
Already have second IO request, issue third
...

We'd be a lot less sensitive to IO latency.

I wonder what kind of gains we would see if every SeqScan in a query spawned a worker just to read tuples and shove them in a queue (or shove a pointer to a buffer in the queue). Similarly, have IndexScans have one worker reading the index and another worker taking index tuples and reading heap tuples...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#116)
Re: Parallel Seq Scan

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

On 1/23/15 10:16 PM, Amit Kapila wrote:

Further, if we want to just get the benefit of parallel I/O, then
I think we can get that by parallelising partition scan where different
table partitions reside on different disk partitions, however that is
a matter of separate patch.

I don't think we even have to go that far.

My experience with Postgres is that it is *very* sensitive to IO latency (not bandwidth). I believe this is the case because complex queries tend to interleave CPU intensive code in-between IO requests. So we see this pattern:

Wait 5ms on IO
Compute for a few ms
Wait 5ms on IO
Compute for a few ms
...

We blindly assume that the kernel will magically do read-ahead for us, but I've never seen that work so great. It certainly falls apart on something like an index scan.

If we could instead do this:

Wait for first IO, issue second IO request
Compute
Already have second IO request, issue third
...

We'd be a lot less sensitive to IO latency.

It would take about five minutes of coding to prove or disprove this:
stick a PrefetchBuffer call into heapgetpage() to launch a request for the
next page as soon as we've read the current one, and then see if that
makes any obvious performance difference. I'm not convinced that it will,
but if it did then we could think about how to make it work for real.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118Amit Kapila
amit.kapila16@gmail.com
In reply to: Jim Nasby (#116)
Re: Parallel Seq Scan

On Tue, Jan 27, 2015 at 3:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 1/23/15 10:16 PM, Amit Kapila wrote:

Further, if we want to just get the benefit of parallel I/O, then
I think we can get that by parallelising partition scan where different
table partitions reside on different disk partitions, however that is
a matter of separate patch.

I don't think we even have to go that far.

We'd be a lot less sensitive to IO latency.

I wonder what kind of gains we would see if every SeqScan in a query

spawned a worker just to read tuples and shove them in a queue (or shove a
pointer to a buffer in the queue).

Here IIUC, you want to say that just get the read done by one parallel
worker and then all expression calculation (evaluation of qualification
and target list) in the main backend, it seems to me that by doing it
that way, the benefit of parallelisation will be lost due to tuple
communication overhead (may be the overhead is less if we just
pass a pointer to buffer but that will have another kind of problems
like holding buffer pins for a longer period of time).

I could see the advantage of testing on lines as suggested by Tom Lane,
but that seems to be not directly related to what we want to achieve by
this patch (parallel seq scan) or if you think otherwise then let me know?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#119Daniel Bausch
bausch@dvs.tu-darmstadt.de
In reply to: Amit Kapila (#8)
Re: Parallel Seq Scan

Hi PG devs!

Tom Lane <tgl@sss.pgh.pa.us> writes:

Wait for first IO, issue second IO request
Compute
Already have second IO request, issue third
...

We'd be a lot less sensitive to IO latency.

It would take about five minutes of coding to prove or disprove this:
stick a PrefetchBuffer call into heapgetpage() to launch a request for the
next page as soon as we've read the current one, and then see if that
makes any obvious performance difference. I'm not convinced that it will,
but if it did then we could think about how to make it work for real.

Sorry for dropping in so late...

I have done all this two years ago. For TPC-H Q8, Q9, Q17, Q20, and Q21
I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
Look-Ahead (the outer loop!).
(On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)

Regards,
Daniel
--
MSc. Daniel Bausch
Research Assistant (Computer Science)
Technische Universität Darmstadt
http://www.dvs.tu-darmstadt.de/staff/dbausch

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120David Fetter
david@fetter.org
In reply to: Daniel Bausch (#119)
Re: Parallel Seq Scan

On Tue, Jan 27, 2015 at 08:02:37AM +0100, Daniel Bausch wrote:

Hi PG devs!

Tom Lane <tgl@sss.pgh.pa.us> writes:

Wait for first IO, issue second IO request
Compute
Already have second IO request, issue third
...

We'd be a lot less sensitive to IO latency.

It would take about five minutes of coding to prove or disprove this:
stick a PrefetchBuffer call into heapgetpage() to launch a request for the
next page as soon as we've read the current one, and then see if that
makes any obvious performance difference. I'm not convinced that it will,
but if it did then we could think about how to make it work for real.

Sorry for dropping in so late...

I have done all this two years ago. For TPC-H Q8, Q9, Q17, Q20, and Q21
I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
Look-Ahead (the outer loop!).
(On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)

Would you be so kind as to pass along any patches (ideally applicable
to git master), tests, and specific measurements you made?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#107)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Script used to test is attached (parallel_count.sh)

Why does this use EXPLAIN ANALYZE instead of \timing ?

IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB

Table Size - 120GB

Used below statements to create table -
create table tbl_perf(c1 int, c2 char(1000));
insert into tbl_perf values(generate_series(1,10000000),'aaaaa');
insert into tbl_perf values(generate_series(10000001,30000000),'aaaaa');
insert into tbl_perf values(generate_series(30000001,110000000),'aaaaa');

I generated this table using this same method and experimented with
copying the whole file to the bit bucket using dd. I did this on
hydra, which I think is the same machine you used.

time for i in `seq 0 119`; do if [ $i -eq 0 ]; then f=16388; else
f=16388.$i; fi; dd if=$f of=/dev/null bs=8k; done

There is a considerable amount of variation in the amount of time this
takes to run based on how much of the relation is cached. Clearly,
there's no way for the system to cache it all, but it can cache a
significant portion, and that affects the results to no small degree.
dd on hydra prints information on the data transfer rate; on uncached
1GB segments, it runs at right around 400 MB/s, but that can soar to
upwards of 3GB/s when the relation is fully cached. I tried flushing
the OS cache via echo 1 > /proc/sys/vm/drop_caches, and found that
immediately after doing that, the above command took 5m21s to run -
i.e. ~321000 ms. Most of your test times are faster than that, which
means they reflect some degree of caching. When I immediately reran
the command a second time, it finished in 4m18s the second time, or
~258000 ms. The rate was the same as the first test - about 400 MB/s
- for most of the files, but 27 of the last 28 files went much faster,
between 1.3 GB/s and 3.7 GB/s.

This tells us that the OS cache on this machine has anti-spoliation
logic in it, probably not dissimilar to what we have in PG. If the
data were cycled through the system cache in strict LRU fashion, any
data that was leftover from the first run would have been flushed out
by the early part of the second run, so that all the results from the
second set of runs would have hit the disk. But in fact, that's not
what happened: the last pages from the first run remained cached even
after reading an amount of new data that exceeds the size of RAM on
that machine. What I think this demonstrates is that we're going to
have to be very careful to control for caching effects, or we may find
that we get misleading results. To make this simpler, I've installed
a setuid binary /usr/bin/drop_caches that you (or anyone who has an
account on that machine) can use you drop the caches; run 'drop_caches
1'.

Block-By-Block

No. of workers/Time (ms) 0 2
Run-1 267798 295051
Run-2 276646 296665
Run-3 281364 314952
Run-4 290231 326243
Run-5 288890 295684

The next thing I did was run test with the block-by-block method after
having dropped the caches. I did this with 0 workers and with 8
workers. I dropped the caches and restarted postgres before each
test, but then ran each test a second time to see the effect of
caching by both the OS and by PostgreSQL. I got these results:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached. But that's not what happened.

I've also got a draft of a prefetching implementation here that I'd
like to test out, but I've just discovered that it's buggy, so I'm
going to send these results for now and work on fixing that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#121)
Re: Parallel Seq Scan

Robert, all,

* Robert Haas (robertmhaas@gmail.com) wrote:

There is a considerable amount of variation in the amount of time this
takes to run based on how much of the relation is cached. Clearly,
there's no way for the system to cache it all, but it can cache a
significant portion, and that affects the results to no small degree.
dd on hydra prints information on the data transfer rate; on uncached
1GB segments, it runs at right around 400 MB/s, but that can soar to
upwards of 3GB/s when the relation is fully cached. I tried flushing
the OS cache via echo 1 > /proc/sys/vm/drop_caches, and found that
immediately after doing that, the above command took 5m21s to run -
i.e. ~321000 ms. Most of your test times are faster than that, which
means they reflect some degree of caching. When I immediately reran
the command a second time, it finished in 4m18s the second time, or
~258000 ms. The rate was the same as the first test - about 400 MB/s
- for most of the files, but 27 of the last 28 files went much faster,
between 1.3 GB/s and 3.7 GB/s.

[...]

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached. But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run. If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs. What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long? It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

One possibility might be round-trip latency. The multi-threaded case is
able to keep the CPUs and the i/o system going, and the cached results
don't have as much latency since things are cached, but the
single-threaded uncached case going i/o -> cpu -> i/o -> cpu, ends up
with a lot of wait time as it switches between being on CPU and waiting
on the i/o.

Just some thoughts.

Thanks,

Stephen

#123Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#112)
Re: Parallel Seq Scan

On Fri, Jan 23, 2015 at 6:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed-Chunks

No. of workers/Time (ms) 0 2 4 8 16 24 32
Run-1 250536 266279 251263 234347 87930 50474 35474
Run-2 249587 230628 225648 193340 83036 35140 9100
Run-3 234963 220671 230002 256183 105382 62493 27903
Run-4 239111 245448 224057 189196 123780 63794 24746
Run-5 239937 222820 219025 220478 114007 77965 39766

I cannot reproduce these results. I applied your fixed-chunk size
patch and ran SELECT parallel_count('tbl_perf', 32) a few times. The
first thing I notice is that, as I predicted, there's an issue with
different workers finishing at different times. For example, from my
first run:

2015-01-27 22:13:09 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34700) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34698) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34701) exited with exit code 0
2015-01-27 22:13:10 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34699) exited with exit code 0
2015-01-27 22:15:00 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34683) exited with exit code 0
2015-01-27 22:15:29 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34673) exited with exit code 0
2015-01-27 22:15:58 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34679) exited with exit code 0
2015-01-27 22:16:38 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34689) exited with exit code 0
2015-01-27 22:16:39 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34671) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34677) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34672) exited with exit code 0
2015-01-27 22:16:48 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34680) exited with exit code 0
2015-01-27 22:16:50 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34686) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34670) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34690) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34674) exited with exit code 0
2015-01-27 22:16:52 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34684) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34675) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34682) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34691) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34676) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34685) exited with exit code 0
2015-01-27 22:16:55 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34692) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34687) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34678) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34681) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34688) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34694) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34693) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34695) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34697) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG: worker process: parallel worker
for PID 34668 (PID 34696) exited with exit code 0

That run started at 22:13:01. Within 4 seconds, 4 workers exited. So
clearly we are not getting the promised 32-way parallelism for the
whole test. Granted, in this instance, *most* of the workers run
until the end, but I think we'll find that there are
uncomfortably-frequent cases where we get significantly less
parallelism than we planned on because the work isn't divided evenly.

But leaving that aside, I've run this test 6 times in a row now, with
a warm cache, and the best time I have is 237310.042 ms and the worst
time I have is 242936.315 ms. So there's very little variation, and
it's reasonably close to the results I got with dd, suggesting that
the system is fairly well I/O bound. At a sequential read speed of
400 MB/s, 240 s = 96 GB of data. Assuming it takes no time at all to
process the cached data (which seems to be not far from wrong judging
by how quickly the first few workers exit), that means we're getting
24 GB of data from cache on a 64 GB machine. That seems a little low,
but if the kernel is refusing to cache the whole relation to avoid
cache-trashing, it could be right.

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups. Moreover,
I'm kind of suspicious about whether your results are actually
physically possible. Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Amit Kapila (#118)
Re: Parallel Seq Scan

On 1/26/15 11:11 PM, Amit Kapila wrote:

On Tue, Jan 27, 2015 at 3:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com <mailto:Jim.Nasby@bluetreble.com>> wrote:

On 1/23/15 10:16 PM, Amit Kapila wrote:

Further, if we want to just get the benefit of parallel I/O, then
I think we can get that by parallelising partition scan where different
table partitions reside on different disk partitions, however that is
a matter of separate patch.

I don't think we even have to go that far.

We'd be a lot less sensitive to IO latency.

I wonder what kind of gains we would see if every SeqScan in a query spawned a worker just to read tuples and shove them in a queue (or shove a pointer to a buffer in the queue).

Here IIUC, you want to say that just get the read done by one parallel
worker and then all expression calculation (evaluation of qualification
and target list) in the main backend, it seems to me that by doing it
that way, the benefit of parallelisation will be lost due to tuple
communication overhead (may be the overhead is less if we just
pass a pointer to buffer but that will have another kind of problems
like holding buffer pins for a longer period of time).

I could see the advantage of testing on lines as suggested by Tom Lane,
but that seems to be not directly related to what we want to achieve by
this patch (parallel seq scan) or if you think otherwise then let me know?

There's some low-hanging fruit when it comes to improving our IO performance (or more specifically, decreasing our sensitivity to IO latency). Perhaps the way to do that is with the parallel infrastructure, perhaps not. But I think it's premature to look at parallelism for increasing IO performance, or worrying about things like how many IO threads we should have before we at least look at simpler things we could do. We shouldn't assume there's nothing to be gained short of a full parallelization implementation.

That's not to say there's nothing else we could use parallelism for. Sort, merge and hash operations come to mind.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stephen Frost (#122)
Re: Parallel Seq Scan

On 1/27/15 3:46 PM, Stephen Frost wrote:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.

With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached. But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run. If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs. What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long? It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

One possibility might be round-trip latency. The multi-threaded case is
able to keep the CPUs and the i/o system going, and the cached results
don't have as much latency since things are cached, but the
single-threaded uncached case going i/o -> cpu -> i/o -> cpu, ends up
with a lot of wait time as it switches between being on CPU and waiting
on the i/o.

This exactly mirrors what I've seen on production systems. On a single SeqScan I can't get anywhere close to the IO performance I could get with dd. Once I got up to 4-8 SeqScans of different tables running together, I saw iostat numbers that were similar to what a single dd bs=8k would do. I've tested this with iSCSI SAN volumes on both 1Gbit and 10Gbit ethernet.

This is why I think that when it comes to IO performance, before we start worrying about real parallelization we should investigate ways to do some kind of async IO.

I only have my SSD laptop and a really old server to test on, but I'll try Tom's suggestion of adding a PrefetchBuffer call into heapgetpage() unless someone beats me to it. I should be able to do it tomorrow.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#122)
Re: Parallel Seq Scan

On Tue, Jan 27, 2015 at 4:46 PM, Stephen Frost <sfrost@snowman.net> wrote:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached. But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run. If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs. What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long? It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

Yeah, I'm wondering if the disk just froze up on that run for a long
while, which has been known to occasionally happen on this machine,
because I can't reproduce that crappy number. I did the 0-worker test
a few more times, with the block-by-block method, dropping the caches
and restarting PostgreSQL each time, and got:

322222.968 ms
322873.325 ms
322967.722 ms
321759.273 ms

After that last run, I ran it a few more times without restarting
PostgreSQL or dropping the caches, and got:

257629.348 ms
289668.976 ms
290342.970 ms
258035.226 ms
284237.729 ms

Then I redid the 8-client test. Cold cache, I got 337312.554 ms. On
the rerun, 323423.813 ms. Third run, 324940.785.

There is more variability than I would like here. Clearly, it goes a
bit faster when the cache is warm, but that's about all I can say with
any confidence.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#123)
Re: Parallel Seq Scan

On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups. Moreover,
I'm kind of suspicious about whether your results are actually
physically possible. Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1]Not my idea.. So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them. Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6. So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment. When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments. That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

All that aside, I still can't account for the numbers you are seeing.
When I run with your patch and what I think is your test case, I get
different (slower) numbers. And even if we've got 6 drives cranking
along at 400MB/s each, that's still only 2.4 GB/s, not >6 GB/s. So
I'm still perplexed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]: Not my idea.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Robert Haas (#127)
Re: Parallel Seq Scan

On 01/28/2015 04:16 AM, Robert Haas wrote:

On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups. Moreover,
I'm kind of suspicious about whether your results are actually
physically possible. Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1]. So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them. Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6. So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment. When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments. That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

OTOH, spreading the I/O across multiple files is not a good thing, if
you don't have a RAID setup like that. With a single spindle, you'll
just induce more seeks.

Perhaps the OS is smart enough to read in large-enough chunks that the
occasional seek doesn't hurt much. But then again, why isn't the OS
smart enough to read in large-enough chunks to take advantage of the
RAID even when you read just a single file?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#128)
Re: Parallel Seq Scan

On Wed, Jan 28, 2015 at 12:38 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 01/28/2015 04:16 AM, Robert Haas wrote:

On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups. Moreover,
I'm kind of suspicious about whether your results are actually
physically possible. Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1]. So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them. Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6. So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment. When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments. That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

OTOH, spreading the I/O across multiple files is not a good thing, if you

don't have a RAID setup like that. With a single spindle, you'll just
induce more seeks.

Yeah, if such a thing happens then there is less chance that user
will get any major benefit via parallel sequential scan unless
the qualification expressions or other expressions used in
statement are costly. So here one way could be that either user
should configure the parallel sequence scan parameters in such
a way that only when it can be beneficial it should perform parallel
scan (something like increase parallel_tuple_comm_cost or we can
have some another parameter) or just not use parallel sequential scan
(parallel_seqscan_degree=0).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#130Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#128)
Re: Parallel Seq Scan

On Wed, Jan 28, 2015 at 2:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

OTOH, spreading the I/O across multiple files is not a good thing, if you
don't have a RAID setup like that. With a single spindle, you'll just induce
more seeks.

Perhaps the OS is smart enough to read in large-enough chunks that the
occasional seek doesn't hurt much. But then again, why isn't the OS smart
enough to read in large-enough chunks to take advantage of the RAID even
when you read just a single file?

Suppose we have N spindles and N worker processes and it just so
happens that the amount of computation is such that a each spindle can
keep one CPU busy. Let's suppose the chunk size is 4MB. If you read
from the relation at N staggered offsets, you might be lucky enough
that each one of them keeps a spindle busy, and you might be lucky
enough to have that stay true as the scans advance. You don't need
any particularly large amount of read-ahead; you just need to stay at
least one block ahead of the CPU. But if you read the relation in one
pass from beginning to end, you need at least N*4MB of read-ahead to
have data in cache for all N spindles, and the read-ahead will
certainly fail you at the end of every 1GB segment.

The problem here, as I see it, is that we're flying blind. If there's
just one spindle, I think it's got to be right to read the relation
sequentially. But if there are multiple spindles, it might not be,
but it seems hard to predict what we should do. We don't know what
the RAID chunk size is or how many spindles there are, so any guess as
to how to chunk up the relation and divide up the work between workers
is just a shot in the dark.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131Thom Brown
thom@linux.com
In reply to: Robert Haas (#130)
Re: Parallel Seq Scan

On 28 January 2015 at 14:03, Robert Haas <robertmhaas@gmail.com> wrote:

The problem here, as I see it, is that we're flying blind. If there's
just one spindle, I think it's got to be right to read the relation
sequentially. But if there are multiple spindles, it might not be,
but it seems hard to predict what we should do. We don't know what
the RAID chunk size is or how many spindles there are, so any guess as
to how to chunk up the relation and divide up the work between workers
is just a shot in the dark.

Can't the planner take effective_io_concurrency into account?

Thom

#132Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#127)
Re: Parallel Seq Scan

On Wed, Jan 28, 2015 at 7:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

All that aside, I still can't account for the numbers you are seeing.
When I run with your patch and what I think is your test case, I get
different (slower) numbers. And even if we've got 6 drives cranking
along at 400MB/s each, that's still only 2.4 GB/s, not >6 GB/s. So
I'm still perplexed.

I have tried the tests again and found that I have forgotten to increase
max_worker_processes due to which the data is so different. Basically
at higher client count it is just scanning lesser number of blocks in
fixed chunk approach. So today I again tried with changing
max_worker_processes and found that there is not much difference in
performance at higher client count. I will take some more data for
both block_by_block and fixed_chunk approach and repost the data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#133Robert Haas
robertmhaas@gmail.com
In reply to: Thom Brown (#131)
Re: Parallel Seq Scan

On Wed, Jan 28, 2015 at 9:12 AM, Thom Brown <thom@linux.com> wrote:

On 28 January 2015 at 14:03, Robert Haas <robertmhaas@gmail.com> wrote:

The problem here, as I see it, is that we're flying blind. If there's
just one spindle, I think it's got to be right to read the relation
sequentially. But if there are multiple spindles, it might not be,
but it seems hard to predict what we should do. We don't know what
the RAID chunk size is or how many spindles there are, so any guess as
to how to chunk up the relation and divide up the work between workers
is just a shot in the dark.

Can't the planner take effective_io_concurrency into account?

Maybe. It's answering a somewhat the right question -- to tell us how
many parallel I/O channels we think we've got. But I'm not quite sure
what the to do with that information in this case. I mean, if we've
got effective_io_concurrency = 6, does that mean it's right to start
scans in 6 arbitrary places in the relation and hope that keeps all
the drives busy? That seems like throwing darts at the wall. We have
no idea which parts are on which underlying devices. Or maybe it mean
we should prefetch 24MB, on the assumption that the RAID stripe is
4MB? That's definitely blind guesswork.

Considering the email Amit just sent, it looks like on this machine,
regardless of what algorithm we used, the scan took between 3 minutes
and 5.5 minutes, and most of them took between 4 minutes and 5.5
minutes. The results aren't very predictable, more workers don't
necessarily help, and it's not really clear that any algorithm we've
tried is clearly better than any other. I experimented with
prefetching a bit yesterday, too, and it was pretty much the same.
Some settings made it slightly faster. Others made it slower. Whee!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#130)
Re: Parallel Seq Scan

Robert Haas <robertmhaas@gmail.com> writes:

The problem here, as I see it, is that we're flying blind. If there's
just one spindle, I think it's got to be right to read the relation
sequentially. But if there are multiple spindles, it might not be,
but it seems hard to predict what we should do. We don't know what
the RAID chunk size is or how many spindles there are, so any guess as
to how to chunk up the relation and divide up the work between workers
is just a shot in the dark.

I thought the proposal to chunk on the basis of "each worker processes
one 1GB-sized segment" should work all right. The kernel should see that
as sequential reads of different files, issued by different processes;
and if it can't figure out how to process that efficiently then it's a
very sad excuse for a kernel.

You are right that trying to do any detailed I/O scheduling by ourselves
is a doomed exercise. For better or worse, we have kept ourselves at
sufficient remove from the hardware that we can't possibly do that
successfully.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#135Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#134)
Re: Parallel Seq Scan

On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The problem here, as I see it, is that we're flying blind. If there's
just one spindle, I think it's got to be right to read the relation
sequentially. But if there are multiple spindles, it might not be,
but it seems hard to predict what we should do. We don't know what
the RAID chunk size is or how many spindles there are, so any guess as
to how to chunk up the relation and divide up the work between workers
is just a shot in the dark.

I thought the proposal to chunk on the basis of "each worker processes
one 1GB-sized segment" should work all right. The kernel should see that
as sequential reads of different files, issued by different processes;
and if it can't figure out how to process that efficiently then it's a
very sad excuse for a kernel.

I agree. But there's only value in doing something like that if we
have evidence that it improves anything. Such evidence is presently a
bit thin on the ground.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#135)
Re: Parallel Seq Scan

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I thought the proposal to chunk on the basis of "each worker processes
one 1GB-sized segment" should work all right. The kernel should see that
as sequential reads of different files, issued by different processes;
and if it can't figure out how to process that efficiently then it's a
very sad excuse for a kernel.

I agree. But there's only value in doing something like that if we
have evidence that it improves anything. Such evidence is presently a
bit thin on the ground.

Well, of course none of this should get committed without convincing
evidence that it's a win. But I think that chunking on relation segment
boundaries is a plausible way of dodging the problem that we can't do
explicitly hardware-aware scheduling.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#137Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#135)
Re: Parallel Seq Scan

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I thought the proposal to chunk on the basis of "each worker processes
one 1GB-sized segment" should work all right. The kernel should see that
as sequential reads of different files, issued by different processes;
and if it can't figure out how to process that efficiently then it's a
very sad excuse for a kernel.

Agreed.

I agree. But there's only value in doing something like that if we
have evidence that it improves anything. Such evidence is presently a
bit thin on the ground.

You need an i/o subsystem that's fast enough to keep a single CPU busy,
otherwise (as you mentioned elsewhere), you're just going to be i/o
bound and having more processes isn't going to help (and could hurt).

Such i/o systems do exist, but a single RAID5 group over spinning rust
with a simple filter isn't going to cut it with a modern CPU- we're just
too darn efficient to end up i/o bound in that case. A more complex
filter might be able to change it over to being more CPU bound than i/o
bound and produce the performance improvments you're looking for.

The caveat to this is if you have multiple i/o *channels* (which it
looks like you don't in this case) where you can parallelize across
those channels by having multiple processes involved. We only support
multiple i/o channels today with tablespaces and we can't span tables
across tablespaces. That's a problem when working with large data sets,
but I'm hopeful that this work will eventually lead to a parallelized
Append node that operates against a partitioned/inheirited table to work
across multiple tablespaces.

Thanks,

Stephen

#138Stephen Frost
sfrost@snowman.net
In reply to: Stephen Frost (#137)
Re: Parallel Seq Scan

* Stephen Frost (sfrost@snowman.net) wrote:

Such i/o systems do exist, but a single RAID5 group over spinning rust
with a simple filter isn't going to cut it with a modern CPU- we're just
too darn efficient to end up i/o bound in that case.

err, to *not* end up i/o bound.

Thanks,

Stephen

#139Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stephen Frost (#137)
Re: Parallel Seq Scan

On 1/28/15 9:56 AM, Stephen Frost wrote:

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I thought the proposal to chunk on the basis of "each worker processes
one 1GB-sized segment" should work all right. The kernel should see that
as sequential reads of different files, issued by different processes;
and if it can't figure out how to process that efficiently then it's a
very sad excuse for a kernel.

Agreed.

I agree. But there's only value in doing something like that if we
have evidence that it improves anything. Such evidence is presently a
bit thin on the ground.

You need an i/o subsystem that's fast enough to keep a single CPU busy,
otherwise (as you mentioned elsewhere), you're just going to be i/o
bound and having more processes isn't going to help (and could hurt).

Such i/o systems do exist, but a single RAID5 group over spinning rust
with a simple filter isn't going to cut it with a modern CPU- we're just
too darn efficient to end up i/o bound in that case. A more complex
filter might be able to change it over to being more CPU bound than i/o
bound and produce the performance improvments you're looking for.

Except we're nowhere near being IO efficient. The vast difference between Postgres IO rates and dd shows this. I suspect that's because we're not giving the OS a list of IO to perform while we're doing our thing, but that's just a guess.

The caveat to this is if you have multiple i/o *channels* (which it
looks like you don't in this case) where you can parallelize across
those channels by having multiple processes involved.

Keep in mind that multiple processes is in no way a requirement for that. Async IO would do that, or even just requesting stuff from the OS before we need it.

We only support
multiple i/o channels today with tablespaces and we can't span tables
across tablespaces. That's a problem when working with large data sets,
but I'm hopeful that this work will eventually lead to a parallelized
Append node that operates against a partitioned/inheirited table to work
across multiple tablespaces.

Until we can get a single seqscan close to dd performance, I fear worrying about tablespaces and IO channels is entirely premature.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#140Stephen Frost
sfrost@snowman.net
In reply to: Jim Nasby (#139)
Re: Parallel Seq Scan

Jim,

* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:

On 1/28/15 9:56 AM, Stephen Frost wrote:

Such i/o systems do exist, but a single RAID5 group over spinning rust
with a simple filter isn't going to cut it with a modern CPU- we're just
too darn efficient to end up i/o bound in that case. A more complex
filter might be able to change it over to being more CPU bound than i/o
bound and produce the performance improvments you're looking for.

Except we're nowhere near being IO efficient. The vast difference between Postgres IO rates and dd shows this. I suspect that's because we're not giving the OS a list of IO to perform while we're doing our thing, but that's just a guess.

Uh, huh? The dd was ~321000 and the slowest uncached PG run from
Robert's latest tests was 337312.554, based on my inbox history at
least. I don't consider ~4-5% difference to be vast.

The caveat to this is if you have multiple i/o *channels* (which it
looks like you don't in this case) where you can parallelize across
those channels by having multiple processes involved.

Keep in mind that multiple processes is in no way a requirement for that. Async IO would do that, or even just requesting stuff from the OS before we need it.

While I agree with this in principle, experience has shown that it
doesn't tend to work out as well as we'd like with a single process.

We only support
multiple i/o channels today with tablespaces and we can't span tables
across tablespaces. That's a problem when working with large data sets,
but I'm hopeful that this work will eventually lead to a parallelized
Append node that operates against a partitioned/inheirited table to work
across multiple tablespaces.

Until we can get a single seqscan close to dd performance, I fear worrying about tablespaces and IO channels is entirely premature.

I feel like one of us is misunderstanding the numbers, which is probably
in part because they're a bit piecemeal over email, but the seqscan
speed in this case looks pretty close to dd performance for this
particular test, when things are uncached. Cached numbers are
different, but that's not what we're discussing here, I don't think.

Don't get me wrong- I've definitely seen cases where we're CPU bound
because of complex filters, etc, but that doesn't seem to be the case
here.

Thanks!

Stephen

#141Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#140)
Re: Parallel Seq Scan

On Wed, Jan 28, 2015 at 8:27 PM, Stephen Frost <sfrost@snowman.net> wrote:

I feel like one of us is misunderstanding the numbers, which is probably
in part because they're a bit piecemeal over email, but the seqscan
speed in this case looks pretty close to dd performance for this
particular test, when things are uncached. Cached numbers are
different, but that's not what we're discussing here, I don't think.

Don't get me wrong- I've definitely seen cases where we're CPU bound
because of complex filters, etc, but that doesn't seem to be the case
here.

To try to clarify a bit: What we've testing here is a function I wrote
called parallel_count(regclass), which counts all the visible tuples
in a named relation. That runs as fast as dd, and giving it extra
workers or prefetching or the ability to read the relation with
different I/O patterns never seems to speed anything up very much.

The story with parallel sequential scan itself may well be different,
since that has a lot more CPU overhead than a dumb-simple
tuple-counter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#142Daniel Bausch
bausch@dvs.tu-darmstadt.de
In reply to: Robert Haas (#133)
Re: Parallel Seq Scan

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jan 28, 2015 at 9:12 AM, Thom Brown <thom@linux.com> wrote:

On 28 January 2015 at 14:03, Robert Haas <robertmhaas@gmail.com> wrote:

The problem here, as I see it, is that we're flying blind. If there's
just one spindle, I think it's got to be right to read the relation
sequentially. But if there are multiple spindles, it might not be,
but it seems hard to predict what we should do. We don't know what
the RAID chunk size is or how many spindles there are, so any guess as
to how to chunk up the relation and divide up the work between workers
is just a shot in the dark.

Can't the planner take effective_io_concurrency into account?

Maybe. It's answering a somewhat the right question -- to tell us how
many parallel I/O channels we think we've got. But I'm not quite sure
what the to do with that information in this case. I mean, if we've
got effective_io_concurrency = 6, does that mean it's right to start
scans in 6 arbitrary places in the relation and hope that keeps all
the drives busy? That seems like throwing darts at the wall. We have
no idea which parts are on which underlying devices. Or maybe it mean
we should prefetch 24MB, on the assumption that the RAID stripe is
4MB? That's definitely blind guesswork.

Considering the email Amit just sent, it looks like on this machine,
regardless of what algorithm we used, the scan took between 3 minutes
and 5.5 minutes, and most of them took between 4 minutes and 5.5
minutes. The results aren't very predictable, more workers don't
necessarily help, and it's not really clear that any algorithm we've
tried is clearly better than any other. I experimented with
prefetching a bit yesterday, too, and it was pretty much the same.
Some settings made it slightly faster. Others made it slower. Whee!

I have been researching this topic long time ago. One notably fact is
that active prefetching disables automatic readahead prefetching (by
Linux kernel), which can occour in larger granularities than 8K.
Automatic readahead prefetching occours when consecutive addresses are
read, which may happen by a seqscan but also by "accident" through an
indexscan in correlated cases.

My consequence was to NOT prefetch seqscans, because OS does good enough
without advice. Prefetching indexscan heap accesses is very valuable
though, but you need to detect the accidential sequential accesses to
not hurt your performance in correlated cases.

In general I can give you the hint to not only focus on HDDs with their
single spindle. A single SATA SSD scales up to 32 (31 on Linux)
requests in parallel (without RAID or anything else). The difference in
throughput is extreme for this type of storage device. While single
spinning HDDs can only gain up to ~20% by NCQ, SATA SSDs can easily gain
up to 700%.

+1 for using effective_io_concurrency to tune for this, since
prefetching random addresses is effectively a type of parallel I/O.

Regards,
Daniel
--
MSc. Daniel Bausch
Research Assistant (Computer Science)
Technische Universität Darmstadt
http://www.dvs.tu-darmstadt.de/staff/dbausch

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#132)
Re: Parallel Seq Scan

On Wed, Jan 28, 2015 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I have tried the tests again and found that I have forgotten to increase
max_worker_processes due to which the data is so different. Basically
at higher client count it is just scanning lesser number of blocks in
fixed chunk approach. So today I again tried with changing
max_worker_processes and found that there is not much difference in
performance at higher client count. I will take some more data for
both block_by_block and fixed_chunk approach and repost the data.

I have again taken the data and found that there is not much difference
either between block-by-block or fixed_chuck approach, the data is at
end of mail for your reference. There is variation in some cases like in
fixed_chunk approach, in 8 workers case it is showing lesser time, however
on certain executions it has taken almost the same time as other workers.

Now if we go with block-by-block approach then we have advantage that
the work distribution granularity will be smaller and hence better and if
we go with chunk-by-chunk (fixed_chunk of 1GB) approach, then there
is good chance that kernel can do the better optimization for reading it.

Based on inputs on this thread, one way for execution strategy could
be:

a. In optimizer, based on effective_io_concurrency, size of relation and
parallel_seqscan_degree, we can decide how many workers can be
used for executing the plan
- choose the number_of_workers equal to effective_io_concurrency,
if it is less than parallel_seqscan_degree, else number_of_workers
will be equal to parallel_seqscan_degree.
- if the size of relation is greater than number_of_workers times GB
(if number_of_workers is 8, then we need to compare the size of
relation with 8GB), then keep number_of_workers intact and distribute
the remaining chunks/segments during execution, else
reduce the number_of_workers such that each worker gets 1GB
to operate.
- if the size of relation is less than 1GB, then we can either not
choose the parallel_seqscan at all or could use smaller chunks
or could use block-by-block approach to execute.
- here we need to consider other parameters like parallel_setup
parallel_startup and tuple_communication cost as well.

b. In executor, if less workers are available than what are required
for statement execution, then we can redistribute the remaining
work among workers.

Performance Data - Before first run of each worker, I have executed
drop_caches to clear the cache and restarted the server, so we can
assume that except Run-1, all other runs have some caching effect.

*Fixed-Chunks*

*No. of workers/Time (ms)* 0 8 16 32 Run-1 322822 245759 330097 330002
Run-2 275685 275428 301625 286251 Run-3 252129 244167 303494 278604 Run-4
252528 259273 250438 258636 Run-5 250612 242072 235384 265918

*Block-By-Block*

*No. of workers/Time (ms)* 0 8 16 32 Run-1 323084 341950 338999 334100
Run-2 310968 349366 344272 322643 Run-3 250312 336227 346276 322274 Run-4
262744 314489 351652 325135 Run-5 265987 316260 342924 319200

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#144Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#128)
Re: Parallel Seq Scan

On Tue, Jan 27, 2015 at 11:08 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 01/28/2015 04:16 AM, Robert Haas wrote:

On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups. Moreover,
I'm kind of suspicious about whether your results are actually
physically possible. Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1]. So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them. Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6. So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment. When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments. That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

OTOH, spreading the I/O across multiple files is not a good thing, if you
don't have a RAID setup like that. With a single spindle, you'll just
induce more seeks.

Perhaps the OS is smart enough to read in large-enough chunks that the
occasional seek doesn't hurt much. But then again, why isn't the OS smart
enough to read in large-enough chunks to take advantage of the RAID even
when you read just a single file?

In my experience with RAID, it is smart enough to take advantage of that.
If the raid controller detects a sequential access pattern read, it
initiates a read ahead on each disk to pre-position the data it will need
(or at least, the behavior I observe is as-if it did that). But maybe if
the sequential read is a bunch of "random" reads from different processes
which just happen to add up to sequential, that confuses the algorithm?

Cheers,

Jeff

#145Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#144)
Re: Parallel Seq Scan

Jeff Janes <jeff.janes@gmail.com> writes:

On Tue, Jan 27, 2015 at 11:08 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

OTOH, spreading the I/O across multiple files is not a good thing, if you
don't have a RAID setup like that. With a single spindle, you'll just
induce more seeks.

Perhaps the OS is smart enough to read in large-enough chunks that the
occasional seek doesn't hurt much. But then again, why isn't the OS smart
enough to read in large-enough chunks to take advantage of the RAID even
when you read just a single file?

In my experience with RAID, it is smart enough to take advantage of that.
If the raid controller detects a sequential access pattern read, it
initiates a read ahead on each disk to pre-position the data it will need
(or at least, the behavior I observe is as-if it did that). But maybe if
the sequential read is a bunch of "random" reads from different processes
which just happen to add up to sequential, that confuses the algorithm?

If seqscan detection is being done at the level of the RAID controller,
I rather imagine that the controller would not know which process had
initiated which read anyway. But if it's being done at the level of the
kernel, it's a whole nother thing, and I bet it *would* matter.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#145)
Re: Parallel Seq Scan

On Thu, Jan 29, 2015 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In my experience with RAID, it is smart enough to take advantage of that.
If the raid controller detects a sequential access pattern read, it
initiates a read ahead on each disk to pre-position the data it will need
(or at least, the behavior I observe is as-if it did that). But maybe if
the sequential read is a bunch of "random" reads from different processes
which just happen to add up to sequential, that confuses the algorithm?

If seqscan detection is being done at the level of the RAID controller,
I rather imagine that the controller would not know which process had
initiated which read anyway. But if it's being done at the level of the
kernel, it's a whole nother thing, and I bet it *would* matter.

That was my feeling too. On the machine that Amit and I have been
using for testing, we can't find any really convincing evidence that
it matters. I won't be a bit surprised if there are other systems
where it does matter, but I don't know how to find them except to
encourage other people to help test.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#147Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Stephen Frost (#140)
Re: Parallel Seq Scan

On 1/28/15 7:27 PM, Stephen Frost wrote:

* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:

On 1/28/15 9:56 AM, Stephen Frost wrote:

Such i/o systems do exist, but a single RAID5 group over spinning rust
with a simple filter isn't going to cut it with a modern CPU- we're just
too darn efficient to end up i/o bound in that case. A more complex
filter might be able to change it over to being more CPU bound than i/o
bound and produce the performance improvments you're looking for.

Except we're nowhere near being IO efficient. The vast difference between Postgres IO rates and dd shows this. I suspect that's because we're not giving the OS a list of IO to perform while we're doing our thing, but that's just a guess.

Uh, huh? The dd was ~321000 and the slowest uncached PG run from
Robert's latest tests was 337312.554, based on my inbox history at
least. I don't consider ~4-5% difference to be vast.

Sorry, I was speaking more generally than this specific test. In the past I've definitely seen SeqScan performance that was an order of magnitude slower than what dd would do. This was an older version of Postgres and an older version of linux, running on an iSCSI SAN. My suspicion is that the added IO latency imposed by iSCSI is what was causing this, but that's just conjecture.

I think Robert was saying that he hasn't been able to see this effect on their test server... that makes me think it's doing read-ahead on the OS level. But I suspect it's pretty touch and go to rely on that; I'd prefer we have some way to explicitly get that behavior where we want it.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148Stephen Frost
sfrost@snowman.net
In reply to: Daniel Bausch (#142)
Re: Parallel Seq Scan

Daniel,

* Daniel Bausch (bausch@dvs.tu-darmstadt.de) wrote:

I have been researching this topic long time ago. One notably fact is
that active prefetching disables automatic readahead prefetching (by
Linux kernel), which can occour in larger granularities than 8K.
Automatic readahead prefetching occours when consecutive addresses are
read, which may happen by a seqscan but also by "accident" through an
indexscan in correlated cases.

That strikes me as a pretty good point to consider.

My consequence was to NOT prefetch seqscans, because OS does good enough
without advice. Prefetching indexscan heap accesses is very valuable
though, but you need to detect the accidential sequential accesses to
not hurt your performance in correlated cases.

Seems like we might be able to do that, it's not that different from
what we do with the bitmap scan case, we'd just look at the bitmap and
see if there's long runs of 1's.

In general I can give you the hint to not only focus on HDDs with their
single spindle. A single SATA SSD scales up to 32 (31 on Linux)
requests in parallel (without RAID or anything else). The difference in
throughput is extreme for this type of storage device. While single
spinning HDDs can only gain up to ~20% by NCQ, SATA SSDs can easily gain
up to 700%.

I definitely agree with the idea that we should be looking at SSD-based
systems but I don't know if anyone happens to have easy access to server
gear with SSDs. I've got an SSD in my laptop, but that's not really the
same thing.

Thanks!

Stephen

#149Daniel Bausch
bausch@dvs.tu-darmstadt.de
In reply to: David Fetter (#120)
4 attachment(s)
Re: Parallel Seq Scan

Hi David and others!

David Fetter <david@fetter.org> writes:

On Tue, Jan 27, 2015 at 08:02:37AM +0100, Daniel Bausch wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Wait for first IO, issue second IO request
Compute
Already have second IO request, issue third
...

We'd be a lot less sensitive to IO latency.

It would take about five minutes of coding to prove or disprove this:
stick a PrefetchBuffer call into heapgetpage() to launch a request for the
next page as soon as we've read the current one, and then see if that
makes any obvious performance difference. I'm not convinced that it will,
but if it did then we could think about how to make it work for real.

Sorry for dropping in so late...

I have done all this two years ago. For TPC-H Q8, Q9, Q17, Q20, and Q21
I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
Look-Ahead (the outer loop!).
(On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)

Would you be so kind as to pass along any patches (ideally applicable
to git master), tests, and specific measurements you made?

Attached find my patches based on the old revision
36f4c7843cf3d201279855ed9a6ebc1deb3c9463
(Adjust cube.out expected output for new test queries.)

I did not test applicability against HEAD by now.

Disclaimer: This was just a proof-of-concept and so is poor
implementation quality. Nevertheless, performance looked promising
while it still needs a lot of extra rules for special cases, like
detecting accidential sequential scans. General assumption is: no
concurrency - a single query owning the machine.

Here is a comparison using dbt3. Q8, Q9, Q17, Q20, and Q21 are
significantly improved.

| | baseline | indexscan | indexscan+nestloop |
| | | patch 1+2 | patch 3 |
|-----+------------+------------+--------------------|
| Q1 | 76.124261 | 73.165161 | 76.323119 |
| Q2 | 9.676956 | 11.211073 | 10.480668 |
| Q3 | 36.836417 | 36.268022 | 36.837226 |
| Q4 | 48.707501 | 64.2255 | 30.872218 |
| Q5 | 59.371467 | 59.205048 | 58.646096 |
| Q6 | 70.514214 | 73.021006 | 72.64643 |
| Q7 | 63.667594 | 63.258499 | 62.758288 |
| Q8 | 70.640973 | 33.144454 | 32.530732 |
| Q9 | 446.630473 | 379.063773 | 219.926094 |
| Q10 | 49.616125 | 49.244744 | 48.411664 |
| Q11 | 6.122317 | 6.158616 | 6.160189 |
| Q12 | 74.294292 | 87.780442 | 87.533936 |
| Q13 | 32.37932 | 32.771938 | 33.483444 |
| Q14 | 47.836053 | 48.093996 | 47.72221 |
| Q15 | 139.350038 | 138.880208 | 138.681336 |
| Q16 | 12.092429 | 12.120661 | 11.668971 |
| Q17 | 9.346636 | 4.106042 | 4.018951 |
| Q18 | 66.106875 | 123.754111 | 122.623193 |
| Q19 | 22.750504 | 23.191532 | 22.34084 |
| Q20 | 80.481986 | 29.906274 | 28.58106 |
| Q21 | 396.897269 | 355.45988 | 214.44184 |
| Q22 | 6.834841 | 6.600922 | 6.524032 |

Regards,
Daniel
--
MSc. Daniel Bausch
Research Assistant (Computer Science)
Technische Universität Darmstadt
http://www.dvs.tu-darmstadt.de/staff/dbausch

Attachments:

0001-Quick-proof-of-concept-for-indexscan-prefetching.patchtext/x-diffDownload
>From 569398929d899100b769abfd919bc3383626ac9f Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Tue, 22 Oct 2013 15:22:25 +0200
Subject: [PATCH 1/4] Quick proof-of-concept for indexscan prefetching

This implements a prefetching queue of tuples whose tid is read ahead.
Their block number is quickly checked for random properties (not current
block and not the block prefetched last).  Random reads are prefetched.
Up to 32 tuples are considered by default.  The tids are queued in a
fixed ring buffer.

The prefetching is implemented in the generic part of the index scan, so
it applies to all access methods.
---
 src/backend/access/index/indexam.c | 96 ++++++++++++++++++++++++++++++++++++++
 src/include/access/relscan.h       | 12 +++++
 2 files changed, 108 insertions(+)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b878155..1c54ef5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -251,6 +251,12 @@ index_beginscan(Relation heapRelation,
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
 
+#ifdef USE_PREFETCH
+	scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
+	scan->xs_last_prefetch = -1;
+	scan->xs_done = false;
+#endif
+
 	return scan;
 }
 
@@ -432,6 +438,55 @@ index_restrpos(IndexScanDesc scan)
 	FunctionCall1(procedure, PointerGetDatum(scan));
 }
 
+static int
+index_prefetch_queue_space(IndexScanDesc scan)
+{
+	if (scan->xs_prefetch_tail < 0)
+		return INDEXSCAN_PREFETCH_COUNT;
+
+	Assert(scan->xs_prefetch_head >= 0);
+
+	return (INDEXSCAN_PREFETCH_COUNT
+			- (scan->xs_prefetch_tail - scan->xs_prefetch_head + 1))
+		% INDEXSCAN_PREFETCH_COUNT;
+}
+
+/* makes copy of ItemPointerData */
+static bool
+index_prefetch_queue_push(IndexScanDesc scan, ItemPointer tid)
+{
+	Assert(index_prefetch_queue_space(scan) > 0);
+
+	if (scan->xs_prefetch_tail == -1)
+		scan->xs_prefetch_head = scan->xs_prefetch_tail = 0;
+	else
+		scan->xs_prefetch_tail =
+			(scan->xs_prefetch_tail + 1) % INDEXSCAN_PREFETCH_COUNT;
+
+	scan->xs_prefetch_queue[scan->xs_prefetch_tail] = *tid;
+
+	return true;
+}
+
+static ItemPointer
+index_prefetch_queue_pop(IndexScanDesc scan)
+{
+	ItemPointer res;
+
+	if (scan->xs_prefetch_head < 0)
+		return NULL;
+
+	res = &scan->xs_prefetch_queue[scan->xs_prefetch_head];
+
+	if (scan->xs_prefetch_head == scan->xs_prefetch_tail)
+		scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
+	else
+		scan->xs_prefetch_head =
+			(scan->xs_prefetch_head + 1) % INDEXSCAN_PREFETCH_COUNT;
+
+	return res;
+}
+
 /* ----------------
  * index_getnext_tid - get the next TID from a scan
  *
@@ -444,12 +499,52 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
 	FmgrInfo   *procedure;
 	bool		found;
+	ItemPointer	from_queue;
+	BlockNumber	pf_block;
 
 	SCAN_CHECKS;
 	GET_SCAN_PROCEDURE(amgettuple);
 
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
 
+#ifdef USE_PREFETCH
+	while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
+		/*
+		 * The AM's amgettuple proc finds the next index entry matching the
+		 * scan keys, and puts the TID into scan->xs_ctup.t_self.  It should
+		 * also set scan->xs_recheck and possibly scan->xs_itup, though we pay
+		 * no attention to those fields here.
+		 */
+		found = DatumGetBool(FunctionCall2(procedure,
+										   PointerGetDatum(scan),
+										   Int32GetDatum(direction)));
+		if (found)
+		{
+			index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
+			pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
+			/* prefetch only if not the current buffer and not exactly the
+			 * previously prefetched buffer (heuristic random detection)
+			 * because sequential read-ahead would be redundant */
+			if ((!BufferIsValid(scan->xs_cbuf) ||
+				 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
+				pf_block != scan->xs_last_prefetch)
+			{
+				PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
+				scan->xs_last_prefetch = pf_block;
+			}
+		}
+		else
+			scan->xs_done = true;
+	}
+	from_queue = index_prefetch_queue_pop(scan);
+	if (from_queue)
+	{
+		scan->xs_ctup.t_self = *from_queue;
+		found = true;
+	}
+	else
+		found = false;
+#else
 	/*
 	 * The AM's amgettuple proc finds the next index entry matching the scan
 	 * keys, and puts the TID into scan->xs_ctup.t_self.  It should also set
@@ -459,6 +554,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	found = DatumGetBool(FunctionCall2(procedure,
 									   PointerGetDatum(scan),
 									   Int32GetDatum(direction)));
+#endif
 
 	/* Reset kill flag immediately for safety */
 	scan->kill_prior_tuple = false;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3a86ca4..bccc1a4 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -93,6 +93,18 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+
+#ifdef USE_PREFETCH
+# ifndef INDEXSCAN_PREFETCH_COUNT
+#  define INDEXSCAN_PREFETCH_COUNT 32
+# endif
+	/* prefetch queue - ringbuffer */
+	ItemPointerData xs_prefetch_queue[INDEXSCAN_PREFETCH_COUNT];
+	int			xs_prefetch_head;
+	int			xs_prefetch_tail;
+	BlockNumber	xs_last_prefetch;
+	bool		xs_done;
+#endif
 }	IndexScanDescData;
 
 /* Struct for heap-or-index scans of system tables */
-- 
2.0.5

0002-Fix-index-only-scan-and-rescan.patchtext/x-diffDownload
>From 7cb5839dd7751bcdcae6e4cbf69cfd24af10a694 Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Wed, 23 Oct 2013 09:45:11 +0200
Subject: [PATCH 2/4] Fix index-only scan and rescan

Prefetching heap data for index-only scans does not make any sense and
it uses a different field (itup), nevertheless.  Deactivate the prefetch
logic for index-only scans.

Reset xs_done and the queue on rescan, so we find tuples again.
Remember last prefetch to detect correlation.
---
 src/backend/access/index/indexam.c | 85 +++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1c54ef5..d8a4622 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -353,6 +353,12 @@ index_rescan(IndexScanDesc scan,
 
 	scan->kill_prior_tuple = false;		/* for safety */
 
+#ifdef USE_PREFETCH
+	/* I think, it does not hurt to remember xs_last_prefetch */
+	scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
+	scan->xs_done = false;
+#endif
+
 	FunctionCall5(procedure,
 				  PointerGetDatum(scan),
 				  PointerGetDatum(keys),
@@ -508,7 +514,47 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
 
 #ifdef USE_PREFETCH
-	while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
+	if (!scan->xs_want_itup)
+	{
+		while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
+			/*
+			 * The AM's amgettuple proc finds the next index entry matching
+			 * the scan keys, and puts the TID into scan->xs_ctup.t_self.  It
+			 * should also set scan->xs_recheck and possibly scan->xs_itup,
+			 * though we pay no attention to those fields here.
+			 */
+			found = DatumGetBool(FunctionCall2(procedure,
+											   PointerGetDatum(scan),
+											   Int32GetDatum(direction)));
+			if (found)
+			{
+				index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
+				pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
+				/* prefetch only if not the current buffer and not exactly the
+				 * previously prefetched buffer (heuristic random detection)
+				 * because sequential read-ahead would be redundant */
+				if ((!BufferIsValid(scan->xs_cbuf) ||
+					 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
+					pf_block != scan->xs_last_prefetch)
+				{
+					PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
+					scan->xs_last_prefetch = pf_block;
+				}
+			}
+			else
+				scan->xs_done = true;
+		}
+		from_queue = index_prefetch_queue_pop(scan);
+		if (from_queue)
+		{
+			scan->xs_ctup.t_self = *from_queue;
+			found = true;
+		}
+		else
+			found = false;
+	}
+	else
+#endif
 		/*
 		 * The AM's amgettuple proc finds the next index entry matching the
 		 * scan keys, and puts the TID into scan->xs_ctup.t_self.  It should
@@ -518,43 +564,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		found = DatumGetBool(FunctionCall2(procedure,
 										   PointerGetDatum(scan),
 										   Int32GetDatum(direction)));
-		if (found)
-		{
-			index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
-			pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
-			/* prefetch only if not the current buffer and not exactly the
-			 * previously prefetched buffer (heuristic random detection)
-			 * because sequential read-ahead would be redundant */
-			if ((!BufferIsValid(scan->xs_cbuf) ||
-				 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
-				pf_block != scan->xs_last_prefetch)
-			{
-				PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
-				scan->xs_last_prefetch = pf_block;
-			}
-		}
-		else
-			scan->xs_done = true;
-	}
-	from_queue = index_prefetch_queue_pop(scan);
-	if (from_queue)
-	{
-		scan->xs_ctup.t_self = *from_queue;
-		found = true;
-	}
-	else
-		found = false;
-#else
-	/*
-	 * The AM's amgettuple proc finds the next index entry matching the scan
-	 * keys, and puts the TID into scan->xs_ctup.t_self.  It should also set
-	 * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
-	 * to those fields here.
-	 */
-	found = DatumGetBool(FunctionCall2(procedure,
-									   PointerGetDatum(scan),
-									   Int32GetDatum(direction)));
-#endif
 
 	/* Reset kill flag immediately for safety */
 	scan->kill_prior_tuple = false;
-- 
2.0.5

0003-First-try-on-tuple-look-ahead-in-nestloop.patchtext/x-diffDownload
>From d8b1533955e3471fb2eb6a030619dcbc258955a8 Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Mon, 28 Oct 2013 10:43:16 +0100
Subject: [PATCH 3/4] First try on tuple look-ahead in nestloop

Similarly to the prefetching logic just added to the index scan, look
ahead tuples in the outer loop of a nested loop scan.  For every tuple
looked ahead issue an explicit request for prefetching to the inner
plan.  Modify the index scan to react on this request.
---
 src/backend/access/index/indexam.c   |  81 +++++++++-----
 src/backend/executor/execProcnode.c  |  36 +++++++
 src/backend/executor/nodeIndexscan.c |  16 +++
 src/backend/executor/nodeNestloop.c  | 200 ++++++++++++++++++++++++++++++++++-
 src/include/access/genam.h           |   4 +
 src/include/executor/executor.h      |   3 +
 src/include/executor/nodeIndexscan.h |   1 +
 src/include/nodes/execnodes.h        |  12 +++
 8 files changed, 323 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index d8a4622..5f44dec 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -493,6 +493,57 @@ index_prefetch_queue_pop(IndexScanDesc scan)
 	return res;
 }
 
+#ifdef USE_PREFETCH
+int
+index_prefetch(IndexScanDesc scan, int maxPrefetch, ScanDirection direction)
+{
+	FmgrInfo   *procedure;
+	int			numPrefetched = 0;
+	bool		found;
+	BlockNumber	pf_block;
+	FILE	   *logfile;
+
+	GET_SCAN_PROCEDURE(amgettuple);
+
+	while (numPrefetched < maxPrefetch && !scan->xs_done &&
+		   index_prefetch_queue_space(scan) > 0)
+	{
+		/*
+		 * The AM's amgettuple proc finds the next index entry matching the
+		 * scan keys, and puts the TID into scan->xs_ctup.t_self.  It should
+		 * also set scan->xs_recheck and possibly scan->xs_itup, though we pay
+		 * no attention to those fields here.
+		 */
+		found = DatumGetBool(FunctionCall2(procedure,
+										   PointerGetDatum(scan),
+										   Int32GetDatum(direction)));
+		if (found)
+		{
+			index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
+			pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
+
+			/*
+			 * Prefetch only if not the current buffer and not exactly the
+			 * previously prefetched buffer (heuristic random detection)
+			 * because sequential read-ahead would be redundant
+			 */
+			if ((!BufferIsValid(scan->xs_cbuf) ||
+				 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
+				pf_block != scan->xs_last_prefetch)
+			{
+				PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
+				scan->xs_last_prefetch = pf_block;
+				numPrefetched++;
+			}
+		}
+		else
+			scan->xs_done = true;
+	}
+
+	return numPrefetched;
+}
+#endif
+
 /* ----------------
  * index_getnext_tid - get the next TID from a scan
  *
@@ -506,7 +557,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	FmgrInfo   *procedure;
 	bool		found;
 	ItemPointer	from_queue;
-	BlockNumber	pf_block;
 
 	SCAN_CHECKS;
 	GET_SCAN_PROCEDURE(amgettuple);
@@ -516,34 +566,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 #ifdef USE_PREFETCH
 	if (!scan->xs_want_itup)
 	{
-		while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
-			/*
-			 * The AM's amgettuple proc finds the next index entry matching
-			 * the scan keys, and puts the TID into scan->xs_ctup.t_self.  It
-			 * should also set scan->xs_recheck and possibly scan->xs_itup,
-			 * though we pay no attention to those fields here.
-			 */
-			found = DatumGetBool(FunctionCall2(procedure,
-											   PointerGetDatum(scan),
-											   Int32GetDatum(direction)));
-			if (found)
-			{
-				index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
-				pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
-				/* prefetch only if not the current buffer and not exactly the
-				 * previously prefetched buffer (heuristic random detection)
-				 * because sequential read-ahead would be redundant */
-				if ((!BufferIsValid(scan->xs_cbuf) ||
-					 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
-					pf_block != scan->xs_last_prefetch)
-				{
-					PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
-					scan->xs_last_prefetch = pf_block;
-				}
-			}
-			else
-				scan->xs_done = true;
-		}
+		index_prefetch(scan, INDEXSCAN_PREFETCH_COUNT, direction);
 		from_queue = index_prefetch_queue_pop(scan);
 		if (from_queue)
 		{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 76dd62f..a8f2c90 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -741,3 +741,39 @@ ExecEndNode(PlanState *node)
 			break;
 	}
 }
+
+
+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *		ExecPrefetchNode
+ *
+ *		Request explicit prefetching from a subtree/node without
+ *		actually forming a tuple.
+ *
+ *		The node shall request at most 'maxPrefetch' pages being
+ *		prefetched.
+ *
+ *		The function returns how many pages have been requested.
+ *
+ *		Calling this function for a type that does not support
+ *		prefetching is not an error.  It just returns 0 as if no
+ *		prefetching was possible.
+ * ----------------------------------------------------------------
+ */
+int
+ExecPrefetchNode(PlanState *node, int maxPrefetch)
+{
+	if (node == NULL)
+		return 0;
+
+	switch (nodeTag(node))
+	{
+		case T_IndexScanState:
+			return ExecPrefetchIndexScan((IndexScanState *) node,
+										 maxPrefetch);
+
+		default:
+			return 0;
+	}
+}
+#endif
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index f1062f1..bab0e7a 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -192,6 +192,22 @@ ExecReScanIndexScan(IndexScanState *node)
 	ExecScanReScan(&node->ss);
 }
 
+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *		ExecPrefetchIndexScan(node, maxPrefetch)
+ *
+ *		Trigger prefetching of index scan without actually fetching
+ *		a tuple.
+ * ----------------------------------------------------------------
+ */
+int
+ExecPrefetchIndexScan(IndexScanState *node, int maxPrefetch)
+{
+	return index_prefetch(node->iss_ScanDesc, maxPrefetch,
+						  node->ss.ps.state->es_direction);
+}
+#endif
+
 
 /*
  * ExecIndexEvalRuntimeKeys
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index c7a08ed..21ad5f8 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -25,6 +25,90 @@
 #include "executor/nodeNestloop.h"
 #include "utils/memutils.h"
 
+#ifdef USE_PREFETCH
+static int
+NestLoopLookAheadQueueSpace(NestLoopState *node)
+{
+	if (node->nl_lookAheadQueueTail < 0)
+		return NESTLOOP_PREFETCH_COUNT;
+
+	Assert(node->nl_lookAheadQueueHead >= 0);
+
+	return (NESTLOOP_PREFETCH_COUNT
+			- (node->nl_lookAheadQueueTail - node->nl_lookAheadQueueHead + 1))
+		% NESTLOOP_PREFETCH_COUNT;
+}
+
+/* makes materialized copy of tuple table slot */
+static bool
+NestLoopLookAheadQueuePush(NestLoopState *node, TupleTableSlot *tuple)
+{
+	TupleTableSlot **queueEntry;
+
+	Assert(NestLoopLookAheadQueueSpace(node) > 0);
+
+	if (node->nl_lookAheadQueueTail == -1)
+		node->nl_lookAheadQueueHead = node->nl_lookAheadQueueTail = 0;
+	else
+		node->nl_lookAheadQueueTail =
+			(node->nl_lookAheadQueueTail +1) % NESTLOOP_PREFETCH_COUNT;
+
+	queueEntry = &node->nl_lookAheadQueue[node->nl_lookAheadQueueTail];
+
+	if (!(*queueEntry))
+	{
+		*queueEntry = ExecInitExtraTupleSlot(node->js.ps.state);
+		ExecSetSlotDescriptor(*queueEntry,
+							  ExecGetResultType(outerPlanState(node)));
+	}
+
+	ExecCopySlot(*queueEntry, tuple);
+
+	return true;
+}
+
+static TupleTableSlot *
+NestLoopLookAheadQueuePop(NestLoopState *node)
+{
+	TupleTableSlot *res;
+
+	if (node->nl_lookAheadQueueHead < 0)
+		return NULL;
+
+	res = node->nl_lookAheadQueue[node->nl_lookAheadQueueHead];
+
+	if (node->nl_lookAheadQueueHead == node->nl_lookAheadQueueTail)
+		node->nl_lookAheadQueueHead = node->nl_lookAheadQueueTail = -1;
+	else
+		node->nl_lookAheadQueueHead =
+			(node->nl_lookAheadQueueHead + 1) % NESTLOOP_PREFETCH_COUNT;
+
+	return res;
+}
+
+static void
+NestLoopLookAheadQueueClear(NestLoopState *node)
+{
+	TupleTableSlot *lookAheadTuple;
+	int		i;
+
+	/*
+	 * As we do not clear the tuple table slots on pop, we need to scan the
+	 * whole array, regardless of the current queue fill.
+	 *
+	 * We cannot really free the slot, as there is no well defined interface
+	 * for that, but the emptied slots will be freed when the query ends.
+	 */
+	for (i = 0; i < NESTLOOP_PREFETCH_COUNT; i++)
+	{
+		lookAheadTuple = node->nl_lookAheadQueue[i];
+		/* look only on pointer - all non NULL fields are non-empty */
+		if (lookAheadTuple)
+			ExecClearTuple(lookAheadTuple);
+	}
+
+}
+#endif /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		ExecNestLoop(node)
@@ -120,7 +204,87 @@ ExecNestLoop(NestLoopState *node)
 		if (node->nl_NeedNewOuter)
 		{
 			ENL1_printf("getting new outer tuple");
-			outerTupleSlot = ExecProcNode(outerPlan);
+
+#ifdef USE_PREFETCH
+			/*
+			 * While we have outer tuples and were not able to request enought
+			 * prefetching from the inner plan to properly load the system,
+			 * request more outer tuples and inner prefetching for them.
+			 *
+			 * Unfortunately we can do outer look-ahead directed prefetching
+			 * only when we are rescanning the inner plan anyway; otherwise we
+			 * would break the inner scan.  Only an independent copy of the
+			 * inner plan state would allow us to prefetch accross inner loops
+			 * regardless of inner scan position.
+			 */
+			while (!node->nl_lookAheadDone &&
+				   node->nl_numInnerPrefetched < NESTLOOP_PREFETCH_COUNT &&
+				   NestLoopLookAheadQueueSpace(node) > 0)
+			{
+				TupleTableSlot *lookAheadTupleSlot = ExecProcNode(outerPlan);
+
+				if (!TupIsNull(lookAheadTupleSlot))
+				{
+					NestLoopLookAheadQueuePush(node, lookAheadTupleSlot);
+
+					/*
+					 * Set inner params according to look-ahead tuple.
+					 *
+					 * Fetch the values of any outer Vars that must be passed
+					 * to the inner scan, and store them in the appropriate
+					 * PARAM_EXEC slots.
+					 */
+					foreach(lc, nl->nestParams)
+					{
+						NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
+						int			paramno = nlp->paramno;
+						ParamExecData *prm;
+
+						prm = &(econtext->ecxt_param_exec_vals[paramno]);
+						/* Param value should be an OUTER_VAR var */
+						Assert(IsA(nlp->paramval, Var));
+						Assert(nlp->paramval->varno == OUTER_VAR);
+						Assert(nlp->paramval->varattno > 0);
+						prm->value = slot_getattr(lookAheadTupleSlot,
+												  nlp->paramval->varattno,
+												  &(prm->isnull));
+						/* Flag parameter value as changed */
+						innerPlan->chgParam =
+							bms_add_member(innerPlan->chgParam, paramno);
+					}
+
+					/*
+					 * Rescan inner plan with changed parameters and request
+					 * explicit prefetch.  Limit the inner prefetch amount
+					 * according to our own bookkeeping.
+					 *
+					 * When the so processed outer tuple gets finally active
+					 * in the inner loop, the inner plan will autonomously
+					 * prefetch the same tuples again.  This is redundant but
+					 * avoiding that seems too complicated for now.  It should
+					 * not hurt too much and may even help in case the
+					 * prefetched blocks have been evicted again in the
+					 * meantime.
+					 */
+					ExecReScan(innerPlan);
+					node->nl_numInnerPrefetched +=
+						ExecPrefetchNode(innerPlan,
+										 NESTLOOP_PREFETCH_COUNT -
+										 node->nl_numInnerPrefetched);
+				}
+				else
+					node->nl_lookAheadDone = true; /* outer plan exhausted */
+			}
+
+			/*
+			 * If there is already the next outerPlan in our look-ahead queue,
+			 * get the next outer tuple from there, otherwise execute the
+			 * outer plan.
+			 */
+			outerTupleSlot = NestLoopLookAheadQueuePop(node);
+			if (TupIsNull(outerTupleSlot) && !node->nl_lookAheadDone)
+#endif /* USE_PREFETCH */
+				outerTupleSlot = ExecProcNode(outerPlan);
 
 			/*
 			 * if there are no more outer tuples, then the join is complete..
@@ -174,6 +338,18 @@ ExecNestLoop(NestLoopState *node)
 		innerTupleSlot = ExecProcNode(innerPlan);
 		econtext->ecxt_innertuple = innerTupleSlot;
 
+#ifdef USE_PREFETCH
+		/*
+		 * Decrement prefetch counter as we cosume inner tuples.  We need to
+		 * check for >0 because prefetching might not have happened for the
+		 * consumed tuple, maybe because explicit prefetching is not supported
+		 * by the inner plan or because the explicit prefetching requested by
+		 * us is exhausted and the inner plan is doing it on its own now.
+		 */
+		if (node->nl_numInnerPrefetched > 0)
+			node->nl_numInnerPrefetched--;
+#endif
+
 		if (TupIsNull(innerTupleSlot))
 		{
 			ENL1_printf("no inner tuple, need new outer tuple");
@@ -296,6 +472,9 @@ NestLoopState *
 ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 {
 	NestLoopState *nlstate;
+#ifdef USE_PREFETCH
+	int i;
+#endif
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -381,6 +560,15 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	nlstate->nl_NeedNewOuter = true;
 	nlstate->nl_MatchedOuter = false;
 
+#ifdef USE_PREFETCH
+	nlstate->nl_lookAheadQueueHead = nlstate->nl_lookAheadQueueTail = -1;
+	nlstate->nl_lookAheadDone = false;
+	nlstate->nl_numInnerPrefetched = 0;
+
+	for (i = 0; i < NESTLOOP_PREFETCH_COUNT; i++)
+		nlstate->nl_lookAheadQueue[i] = NULL;
+#endif
+
 	NL1_printf("ExecInitNestLoop: %s\n",
 			   "node initialized");
 
@@ -409,6 +597,10 @@ ExecEndNestLoop(NestLoopState *node)
 	 */
 	ExecClearTuple(node->js.ps.ps_ResultTupleSlot);
 
+#ifdef USE_PREFETCH
+	NestLoopLookAheadQueueClear(node);
+#endif
+
 	/*
 	 * close down subplans
 	 */
@@ -444,4 +636,10 @@ ExecReScanNestLoop(NestLoopState *node)
 	node->js.ps.ps_TupFromTlist = false;
 	node->nl_NeedNewOuter = true;
 	node->nl_MatchedOuter = false;
+
+#ifdef USE_PREFETCH
+	NestLoopLookAheadQueueClear(node);
+	node->nl_lookAheadDone = false;
+	node->nl_numInnerPrefetched = 0;
+#endif
 }
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a800041..7733b3c 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -146,6 +146,10 @@ extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 				  ScanDirection direction);
+#ifdef USE_PREFETCH
+extern int index_prefetch(IndexScanDesc scan, int maxPrefetch,
+						  ScanDirection direction);
+#endif
 extern HeapTuple index_fetch_heap(IndexScanDesc scan);
 extern HeapTuple index_getnext(IndexScanDesc scan, ScanDirection direction);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 75841c8..88d0522 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -221,6 +221,9 @@ extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
+#ifdef USE_PREFETCH
+extern int ExecPrefetchNode(PlanState *node, int maxPrefetch);
+#endif
 
 /*
  * prototypes from functions in execQual.c
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 71dbd9c..f93632c 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -18,6 +18,7 @@
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
+extern int ExecPrefetchIndexScan(IndexScanState *node, int maxPrefetch);
 extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b430e0..27fe65d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1526,6 +1526,18 @@ typedef struct NestLoopState
 	bool		nl_NeedNewOuter;
 	bool		nl_MatchedOuter;
 	TupleTableSlot *nl_NullInnerTupleSlot;
+
+#ifdef USE_PREFETCH
+# ifndef NESTLOOP_PREFETCH_COUNT
+#  define NESTLOOP_PREFETCH_COUNT 32
+# endif
+	/* look-ahead queue (for prefetching) - ringbuffer */
+	TupleTableSlot *nl_lookAheadQueue[NESTLOOP_PREFETCH_COUNT];
+	int			nl_lookAheadQueueHead;
+	int			nl_lookAheadQueueTail;
+	bool		nl_lookAheadDone;
+	int			nl_numInnerPrefetched;
+#endif
 } NestLoopState;
 
 /* ----------------
-- 
2.0.5

0004-Limit-recursive-prefetching-for-merge-join.patchtext/x-diffDownload
>From a1fcab2d9d001505a5fc25accdca71e88148e4ff Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Tue, 29 Oct 2013 16:41:09 +0100
Subject: [PATCH 4/4] Limit recursive prefetching for merge join

Add switch facility to limit the prefetching of a subtree recursively.
In a first try add support for some variants of merge join.  Distribute
the prefetch allowance evenly between outer and inner subplan.
---
 src/backend/access/index/indexam.c   |  5 +++-
 src/backend/executor/execProcnode.c  | 47 +++++++++++++++++++++++++++++++++++-
 src/backend/executor/nodeAgg.c       | 10 ++++++++
 src/backend/executor/nodeIndexscan.c | 18 ++++++++++++++
 src/backend/executor/nodeMaterial.c  | 14 +++++++++++
 src/backend/executor/nodeMergejoin.c | 22 +++++++++++++++++
 src/include/access/relscan.h         |  1 +
 src/include/executor/executor.h      |  1 +
 src/include/executor/nodeAgg.h       |  3 +++
 src/include/executor/nodeIndexscan.h |  3 +++
 src/include/executor/nodeMaterial.h  |  3 +++
 src/include/executor/nodeMergejoin.h |  3 +++
 src/include/nodes/execnodes.h        |  6 +++++
 13 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 5f44dec..354bde6 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -255,6 +255,7 @@ index_beginscan(Relation heapRelation,
 	scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
 	scan->xs_last_prefetch = -1;
 	scan->xs_done = false;
+	scan->xs_prefetch_limit = INDEXSCAN_PREFETCH_COUNT;
 #endif
 
 	return scan;
@@ -506,7 +507,9 @@ index_prefetch(IndexScanDesc scan, int maxPrefetch, ScanDirection direction)
 	GET_SCAN_PROCEDURE(amgettuple);
 
 	while (numPrefetched < maxPrefetch && !scan->xs_done &&
-		   index_prefetch_queue_space(scan) > 0)
+		   index_prefetch_queue_space(scan) > 0 &&
+		   index_prefetch_queue_space(scan) >
+		   (INDEXSCAN_PREFETCH_COUNT - scan->xs_prefetch_limit))
 	{
 		/*
 		 * The AM's amgettuple proc finds the next index entry matching the
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a8f2c90..a14a0d0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -745,6 +745,51 @@ ExecEndNode(PlanState *node)
 
 #ifdef USE_PREFETCH
 /* ----------------------------------------------------------------
+ *		ExecLimitPrefetchNode
+ *
+ *		Limit the amount of prefetching that may be requested by
+ *		a subplan.
+ *
+ *		Most of the handlers just pass-through the received value
+ *		to their subplans.  That is the case, when they have just
+ *		one subplan that might prefetch.  If they have two subplans
+ *		intelligent heuristics need to be applied to distribute the
+ *		prefetch allowance in a way delivering overall advantage.
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchNode(PlanState *node, int limit)
+{
+	if (node == NULL)
+		return;
+
+	switch (nodeTag(node))
+	{
+		case T_IndexScanState:
+			ExecLimitPrefetchIndexScan((IndexScanState *) node, limit);
+			break;
+
+		case T_MergeJoinState:
+			ExecLimitPrefetchMergeJoin((MergeJoinState *) node, limit);
+			break;
+
+		case T_MaterialState:
+			ExecLimitPrefetchMaterial((MaterialState *) node, limit);
+			break;
+
+		case T_AggState:
+			ExecLimitPrefetchAgg((AggState *) node, limit);
+			break;
+
+		default:
+			elog(INFO,
+				 "missing ExecLimitPrefetchNode handler for node type: %d",
+				 (int) nodeTag(node));
+			break;
+	}
+}
+
+/* ----------------------------------------------------------------
  *		ExecPrefetchNode
  *
  *		Request explicit prefetching from a subtree/node without
@@ -776,4 +821,4 @@ ExecPrefetchNode(PlanState *node, int maxPrefetch)
 			return 0;
 	}
 }
-#endif
+#endif /* USE_PREFETCH */
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index e02a6ff..94f6d77 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1877,6 +1877,16 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	return aggstate;
 }
 
+#ifdef USE_PREFETCH
+void
+ExecLimitPrefetchAgg(AggState *node, int limit)
+{
+	Assert(node != NULL);
+
+	ExecLimitPrefetchNode(outerPlanState(node), limit);
+}
+#endif
+
 static Datum
 GetAggInitVal(Datum textInitVal, Oid transtype)
 {
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index bab0e7a..6ea236e 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -640,6 +640,24 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	return indexstate;
 }
 
+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *		ExecLimitPrefetchIndexScan
+ *
+ *		Sets/changes the number of tuples whose pages to request in
+ *		advance.
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchIndexScan(IndexScanState *node, int limit)
+{
+	Assert(node != NULL);
+	Assert(node->iss_ScanDesc != NULL);
+
+	node->iss_ScanDesc->xs_prefetch_limit = limit;
+}
+#endif
+
 
 /*
  * ExecIndexBuildScanKeys
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 7a82f56..3370362 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -232,6 +232,20 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	return matstate;
 }
 
+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *		ExecLimitPrefetchMaterial
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchMaterial(MaterialState *node, int limit)
+{
+	Assert(node != NULL);
+
+	ExecLimitPrefetchNode(outerPlanState(node), limit);
+}
+#endif
+
 /* ----------------------------------------------------------------
  *		ExecEndMaterial
  * ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index e69bc64..f25e074 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1627,6 +1627,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	mergestate->mj_OuterTupleSlot = NULL;
 	mergestate->mj_InnerTupleSlot = NULL;
 
+#ifdef USE_PREFETCH
+	ExecLimitPrefetchMergeJoin(mergestate, MERGEJOIN_PREFETCH_COUNT);
+#endif
+
 	/*
 	 * initialization successful
 	 */
@@ -1636,6 +1640,24 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	return mergestate;
 }
 
+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *		ExecLimitPrefetchMergeJoin
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchMergeJoin(MergeJoinState *node, int limit)
+{
+	int outerLimit = limit/2;
+	int innerLimit = limit/2;
+
+	Assert(node != NULL);
+
+	ExecLimitPrefetchNode(outerPlanState(node), outerLimit);
+	ExecLimitPrefetchNode(innerPlanState(node), innerLimit);
+}
+#endif
+
 /* ----------------------------------------------------------------
  *		ExecEndMergeJoin
  *
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index bccc1a4..3297900 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -104,6 +104,7 @@ typedef struct IndexScanDescData
 	int			xs_prefetch_tail;
 	BlockNumber	xs_last_prefetch;
 	bool		xs_done;
+	int			xs_prefetch_limit;
 #endif
 }	IndexScanDescData;
 
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 88d0522..09b94e0 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -222,6 +222,7 @@ extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 #ifdef USE_PREFETCH
+extern void ExecLimitPrefetchNode(PlanState *node, int limit);
 extern int ExecPrefetchNode(PlanState *node, int maxPrefetch);
 #endif
 
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 38823d6..f775ec8 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchAgg(AggState *node, int limit);
+#endif
 extern TupleTableSlot *ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index f93632c..ccf3121 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchIndexScan(IndexScanState *node, int limit);
+#endif
 extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
 extern int ExecPrefetchIndexScan(IndexScanState *node, int maxPrefetch);
 extern void ExecEndIndexScan(IndexScanState *node);
diff --git a/src/include/executor/nodeMaterial.h b/src/include/executor/nodeMaterial.h
index cfca0a5..5c81fe8 100644
--- a/src/include/executor/nodeMaterial.h
+++ b/src/include/executor/nodeMaterial.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"
 
 extern MaterialState *ExecInitMaterial(Material *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchMaterial(MaterialState *node, int limit);
+#endif
 extern TupleTableSlot *ExecMaterial(MaterialState *node);
 extern void ExecEndMaterial(MaterialState *node);
 extern void ExecMaterialMarkPos(MaterialState *node);
diff --git a/src/include/executor/nodeMergejoin.h b/src/include/executor/nodeMergejoin.h
index fa6b5e0..e402b42 100644
--- a/src/include/executor/nodeMergejoin.h
+++ b/src/include/executor/nodeMergejoin.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"
 
 extern MergeJoinState *ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchMergeJoin(MergeJoinState *node, int limit);
+#endif
 extern TupleTableSlot *ExecMergeJoin(MergeJoinState *node);
 extern void ExecEndMergeJoin(MergeJoinState *node);
 extern void ExecReScanMergeJoin(MergeJoinState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 27fe65d..64ed6fb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1585,6 +1585,12 @@ typedef struct MergeJoinState
 	ExprContext *mj_InnerEContext;
 } MergeJoinState;
 
+#ifdef USE_PREFETCH
+# ifndef MERGEJOIN_PREFETCH_COUNT
+#  define MERGEJOIN_PREFETCH_COUNT 32
+# endif
+#endif
+
 /* ----------------
  *	 HashJoinState information
  *
-- 
2.0.5

#150Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#103)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Jan 22, 2015 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Jan 22, 2015 at 6:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com>

wrote:

(Please point out me if my understanding is incorrect.)

What happen if dynamic background worker process tries to reference

temporary

tables? Because buffer of temporary table blocks are allocated on

private

address space, its recent status is not visible to other process unless

it is

not flushed to the storage every time.

Do we need to prohibit create_parallelscan_paths() to generate a path

when

target relation is temporary one?

Yes, we need to prohibit parallel scans on temporary relations. Will fix.

Here is the latest patch which fixes reported issues and supported
Prepared Statements and Explain Statement for parallel sequential
scan.

The main purpose is to get the feedback if possible on overall
structure/design of code before I goahead.

Note -
a. it is still based on parallel-mode-v1 [1]/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com patch of Robert.
b. based on CommitId - fd496129 [on top of this commit, apply
Robert's patch and then the attached patch]
c. just build and tested on Windows, my linux box has some
problem, will fix that soon and verify this on linux as well.

[1]: /messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com
/messages/by-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v6.patchapplication/octet-stream; name=parallel_seqscan_v6.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..1afac59 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -243,7 +243,19 @@ SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist, int16 *formats)
 				pq_sendint(&buf, 0, 2);
 		}
 	}
-	pq_endmessage(&buf);
+
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 }
 
 /*
@@ -371,7 +383,18 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 		}
 	}
 
-	pq_endmessage(&buf);
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57408d3..784d79d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -307,6 +307,12 @@ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
 	scan->rs_numblocks = numBlks;
 }
 
+void
+heap_setsyncscan(HeapScanDesc scan, bool sync_scan)
+{
+	scan->rs_syncscan = sync_scan;
+}
+
 /*
  * heapgetpage - subroutine for heapgettup()
  *
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..758d7e8
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,375 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId);
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId);
+
+/*
+ * shm_beginscan -
+ *		Initializes the shared memory scan descriptor to retrieve tuples
+ *		from worker backends. 
+ */
+ShmScanDesc
+shm_beginscan(int num_queues)
+{
+	ShmScanDesc		shmscan;
+
+	shmscan = palloc(sizeof(ShmScanDescData));
+
+	shmscan->num_shm_queues = num_queues;
+	shmscan->ss_cqueue = -1;
+	shmscan->shmscan_inited	= false;
+
+	return shmscan;
+}
+
+/*
+ * ExecInitWorkerResult -
+ *		Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers)
+{
+	worker_result	workerResult;
+	int				i;
+	int	natts = tupdesc->natts;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+	workerResult->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	workerResult->typioparams = palloc(sizeof(Oid) * natts);
+	workerResult->num_shm_queues = nWorkers;
+	workerResult->has_row_description = palloc0(sizeof(bool) * nWorkers);
+	workerResult->queue_detached = palloc0(sizeof(bool) * nWorkers);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &workerResult->typioparams[i]);
+		fmgr_info(receive_function_id, &workerResult->receive_functions[i]);
+	}
+
+	return workerResult;
+}
+
+
+/*
+ * shm_getnext -
+ *		Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+			worker_result resultState, shm_mq_handle **responseq,
+			TupleDesc tupdesc, ScanDirection direction, bool *fromheap)
+{
+	shm_mq_result	res;
+	Size			nbytes;
+	void			*data;
+	StringInfoData	msg;
+	int				queueId = 0;
+
+	/*
+	 * calculate next starting queue used for fetching tuples
+	 */
+	if(!shmScan->shmscan_inited)
+	{
+		shmScan->shmscan_inited = true;
+		Assert(shmScan->num_shm_queues > 0);
+		queueId = 0;
+	}
+	else
+		queueId = shmScan->ss_cqueue;
+
+	/* Read and processes messages from the shared memory queues. */
+	for(;;)
+	{
+		if (!resultState->all_queues_detached)
+		{
+			if (queueId == shmScan->num_shm_queues)
+				queueId = 0;
+
+			/*
+			 * Don't fetch from detached queue.  This loop could continue
+			 * forever, if we reach a situation such that all queue's are
+			 * detached, however we won't reach here if that is the case.
+			 */
+			while (resultState->queue_detached[queueId])
+			{
+				++queueId;
+				if (queueId == shmScan->num_shm_queues)
+					queueId = 0;
+			}
+
+			for (;;)
+			{
+				/*
+				 * mark current queue used for fetching tuples, this is used
+				 * to fetch consecutive tuples from queue used in previous
+				 * fetch.
+				 */
+				shmScan->ss_cqueue = queueId;
+
+				/* Get next message. */
+				res = shm_mq_receive(responseq[queueId], &nbytes, &data, true);
+				if (res == SHM_MQ_DETACHED)
+				{
+					/*
+					 * mark the queue that got detached, so that we don't
+					 * try to fetch from it again.
+					 */
+					resultState->queue_detached[queueId] = true;
+					resultState->has_row_description[queueId] = false;
+					--resultState->num_shm_queues;
+					/*
+					 * if we have exhausted data from all worker queues, then don't
+					 * process data from queues.
+					 */
+					if (resultState->num_shm_queues <= 0)
+						resultState->all_queues_detached = true;
+					break;
+				}
+				else if (res == SHM_MQ_WOULD_BLOCK)
+					break;
+				else if (res == SHM_MQ_SUCCESS)
+				{
+					bool rettuple;
+					initStringInfo(&msg);
+					appendBinaryStringInfo(&msg, data, nbytes);
+					rettuple = HandleParallelTupleMessage(resultState, tupdesc, &msg, queueId);
+					pfree(msg.data);
+					if (rettuple)
+					{
+						*fromheap = false;
+						return resultState->tuple;
+					}
+				}
+			}
+		}
+
+		/*
+		 * if we have checked all the message queue's and didn't find
+		 * any message or we have already fetched all the data from queue's,
+		 * then it's time to fetch directly from heap.  Reset the current
+		 * queue as the first queue from which we need to receive tuples.
+		 */
+		if ((queueId == shmScan->num_shm_queues - 1 ||
+			 resultState->all_queues_detached) &&
+			 !resultState->all_heap_fetched)
+		{
+			HeapTuple	tuple;
+			shmScan->ss_cqueue = 0;
+			tuple = heap_getnext(scanDesc, direction);
+			if (tuple)
+			{
+				*fromheap = true;
+				return tuple;
+			}
+			else if (tuple == NULL && resultState->all_queues_detached)
+				break;
+			else
+				resultState->all_heap_fetched = true;
+		}
+		else if (resultState->all_queues_detached &&
+				 resultState->all_heap_fetched)
+			break;
+
+		/* check the data in next queue. */
+		++queueId;
+	}
+
+	return NULL;
+}
+
+/*
+ * HandleParallelTupleMessage -
+ * Handle a single tuple related protocol message received from
+ * a single parallel worker.
+ */
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId)
+{
+	char	msgtype;
+	bool	rettuple = false;
+
+	msgtype = pq_getmsgbyte(msg);
+
+	/* Dispatch on message type. */
+	switch (msgtype)
+	{
+		case 'T':
+			{
+				int16	natts = pq_getmsgint(msg, 2);
+				int16	i;
+
+				if (resultState->has_row_description[queueId])
+					elog(ERROR, "multiple RowDescription messages");
+				resultState->has_row_description[queueId] = true;
+				if (natts != tupdesc->natts)
+					ereport(ERROR,
+							(errcode(ERRCODE_DATATYPE_MISMATCH),
+								errmsg("worker result rowtype does not match "
+								"the specified FROM clause rowtype")));
+
+				for (i = 0; i < natts; ++i)
+				{
+					Oid		type_id;
+
+					(void) pq_getmsgstring(msg);	/* name */
+					(void) pq_getmsgint(msg, 4);	/* table OID */
+					(void) pq_getmsgint(msg, 2);	/* table attnum */
+					type_id = pq_getmsgint(msg, 4);	/* type OID */
+					(void) pq_getmsgint(msg, 2);	/* type length */
+					(void) pq_getmsgint(msg, 4);	/* typmod */
+					(void) pq_getmsgint(msg, 2);	/* format code */
+
+					if (type_id != tupdesc->attrs[i]->atttypid)
+						ereport(ERROR,
+								(errcode(ERRCODE_DATATYPE_MISMATCH),
+								 errmsg("remote query result rowtype does not match "
+										"the specified FROM clause rowtype")));
+				}
+
+				pq_getmsgend(msg);
+
+				break;
+			}
+		case 'D':
+			{
+				/* Handle DataRow message. */
+				resultState->tuple = form_result_tuple(resultState, tupdesc, msg, queueId);
+				rettuple = true;
+				break;
+			}
+		case 'C':
+			{
+				/*
+					* Handle CommandComplete message. Ignore tags sent by
+					* worker backend as we are anyway going to use tag of
+					* master backend for sending the same to client.
+					*/
+				(void) pq_getmsgstring(msg);
+				break;
+			}
+		case 'G':
+		case 'H':
+		case 'W':
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("COPY protocol not allowed in worker")));
+			}
+		default:
+			elog(WARNING, "unknown message type: %c", msg->data[0]);
+			break;
+	}
+
+	return rettuple;
+}
+
+/*
+ * form_result_tuple -
+ * Parse a DataRow message and form a result tuple.
+ */
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId)
+{
+	/* Handle DataRow message. */
+	int16	natts = pq_getmsgint(msg, 2);
+	int16	i;
+	Datum  *values = NULL;
+	bool   *isnull = NULL;
+	HeapTuple	tuple;
+	StringInfoData	buf;
+
+	if (!resultState->has_row_description[queueId])
+		elog(ERROR, "DataRow not preceded by RowDescription");
+	if (natts != tupdesc->natts)
+		elog(ERROR, "malformed DataRow");
+	if (natts > 0)
+	{
+		values = palloc(natts * sizeof(Datum));
+		isnull = palloc(natts * sizeof(bool));
+	}
+	initStringInfo(&buf);
+
+	for (i = 0; i < natts; ++i)
+	{
+		int32	bytes = pq_getmsgint(msg, 4);
+
+		if (bytes < 0)
+		{
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											NULL,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = true;
+		}
+		else
+		{
+			resetStringInfo(&buf);
+			appendBinaryStringInfo(&buf, pq_getmsgbytes(msg, bytes), bytes);
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											&buf,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = false;
+		}
+	}
+
+	pq_getmsgend(msg);
+
+	tuple = heap_form_tuple(tupdesc, values, isnull);
+
+	/*
+	 * Release locally palloc'd space.  XXX would probably be good to pfree
+	 * values of pass-by-reference datums, as well.
+	 */
+	pfree(values);
+	pfree(isnull);
+
+	pfree(buf.data);
+
+	return tuple;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7cfc9bb..3b5b4c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -917,6 +918,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_ParallelSeqScan:
+			pname = sname = "Parallel Seq Scan";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1066,6 +1070,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1207,6 +1212,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (nodeTag(plan) == T_ParallelSeqScan)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((ParallelSeqScanState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((ParallelSeqScanState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1332,6 +1355,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_ParallelSeqScan:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((ParallelSeqScan *) plan)->num_workers, es);
+			ExplainPropertyInteger("Number of Blocks Per Worker",
+				((ParallelSeqScan *) plan)->num_blocks_per_worker, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2224,6 +2257,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 2794537..33eef6e 100644
--- a/src/backend/commands/portalcmds.c
+++ b/src/backend/commands/portalcmds.c
@@ -121,7 +121,7 @@ PerformCursorOpen(PlannedStmt *stmt, ParamListInfo params,
 	/*
 	 * Start execution, inserting parameters if any.
 	 */
-	PortalStart(portal, params, 0, GetActiveSnapshot());
+	PortalStart(portal, params, 0, GetActiveSnapshot(), 0);
 
 	Assert(portal->strategy == PORTAL_ONE_SELECT);
 
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 71b08f0..93ae6b3 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -289,7 +289,7 @@ ExecuteQuery(ExecuteStmt *stmt, IntoClause *intoClause,
 	/*
 	 * Run the portal as appropriate.
 	 */
-	PortalStart(portal, paramLI, eflags, GetActiveSnapshot());
+	PortalStart(portal, paramLI, eflags, GetActiveSnapshot(), 0);
 
 	(void) PortalRun(portal, count, false, dest, dest, completionTag);
 
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..9a8ca75 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,7 +21,7 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeSeqscan.o nodeParallelSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 6414cb9..858e5e8 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -332,7 +332,29 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		(*dest->rShutdown) (dest);
 
 	if (queryDesc->totaltime)
+	{
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+		/*
+		 * Aggregate instrumentation information of all the backend
+		 * workers for parallel sequence scan.
+		 */
+		/*if (nodeTag(queryDesc->planstate->plan) == T_ParallelSeqScan)
+		{
+			int i;
+			Instrumentation *instrument_worker;
+			int nworkers =
+				((ParallelSeqScanState *)queryDesc->planstate)->pcxt->nworkers;
+			char *inst_info_workers =
+				((ParallelSeqScanState *)queryDesc->planstate)->inst_options_space;
+
+			for (i = 0; i < nworkers; i++)
+			{
+				instrument_worker =
+					(Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+				InstrAggNode(queryDesc->planstate->instrument, instrument_worker);
+			}
+		}*/
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..f77a77f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeParallelSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +191,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_ParallelSeqScan:
+			result = (PlanState *) ExecInitParallelSeqScan((ParallelSeqScan *) node,
+														   estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +412,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			result = ExecParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +654,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			ExecEndParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 3f0d809..39c624d 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -191,8 +191,17 @@ ExecScan(ScanState *node,
 		 * check for non-nil qual here to avoid a function call to ExecQual()
 		 * when the qual is nil ... saves only a few cycles, but they add up
 		 * ...
+		 *
+		 * check for non-heap tuples (can get such tuples from shared memory
+		 * message queue's in case of parallel query), for such tuples no need
+		 * to perform qualification as for them the same is done by backend
+		 * worker.  This case will happen only for parallel query where we push
+		 * down the qualification.
+		 * XXX - We can do this optimization for projection as well, but for
+		 * now it is okay, as we don't allow parallel query if there are
+		 * expressions involved in target list.
 		 */
-		if (!qual || ExecQual(qual, econtext, false))
+		if (!slot->tts_fromheap || !qual || ExecQual(qual, econtext, false))
 		{
 			/*
 			 * Found a satisfactory scan tuple.
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 753754d..4c5bd88 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -123,6 +123,7 @@ MakeTupleTableSlot(void)
 	slot->tts_values = NULL;
 	slot->tts_isnull = NULL;
 	slot->tts_mintuple = NULL;
+	slot->tts_fromheap	= true;
 
 	return slot;
 }
@@ -473,6 +474,8 @@ ExecClearTuple(TupleTableSlot *slot)	/* slot in which to store tuple */
 	slot->tts_isempty = true;
 	slot->tts_nvalid = 0;
 
+	slot->tts_fromheap = true;
+
 	return slot;
 }
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..b7898a5 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,8 @@ BufferUsage pgBufferUsage;
 
 static void BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add, const BufferUsage *sub);
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 
 /* Allocate new instrumentation structure(s) */
@@ -127,6 +129,28 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
 static void
 BufferUsageAccumDiff(BufferUsage *dst,
@@ -148,3 +172,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
\ No newline at end of file
diff --git a/src/backend/executor/nodeParallelSeqscan.c b/src/backend/executor/nodeParallelSeqscan.c
new file mode 100644
index 0000000..b7a9e79
--- /dev/null
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -0,0 +1,329 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeParallelSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeParallelSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecParallelSeqScan				sequentially scans a relation.
+ *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		ExecInitParallelSeqScan			creates and initializes a parallel seqscan node.
+ *		ExecEndParallelSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ParallelSeqNext
+ *
+ *		This is a workhorse for ExecParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ParallelSeqNext(ParallelSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+	bool			fromheap = true;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(scandesc, node->pss_currentShmScanDesc,
+						node->pss_workerResult,
+						node->responseq,
+						node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+						direction, &fromheap);
+
+	slot->tts_fromheap = fromheap;
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass '!fromheap'
+	 * because tuples returned by shm_getnext() are either pointers that are
+	 * created with palloc() or are pointers onto disk pages and so it should
+	 * be pfree()'d accordingly.  Note also that ExecStoreTuple will increment
+	 * the refcount of the buffer; the refcount will not be dropped until the
+	 * tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   fromheap ? scandesc->rs_cbuf : InvalidBuffer, /* buffer associated with this
+																	  * tuple */
+					   !fromheap);	/* pfree this pointer if not from heap */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * ParallelSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+ParallelSeqRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, ParallelSeqScan never use keys in
+	 * shm_beginscan/heap_beginscan (and this is very bad) - so, here
+	 * we do not check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitParallelScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitParallelScanRelation(SeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/* initialize a heapscan */
+	currentScanDesc = heap_beginscan(currentRelation,
+									 estate->es_snapshot,
+									 0,
+									 NULL);
+
+	/*
+	 * Each backend worker participating in parallel sequiantial
+	 * scan operate on different set of blocks, so there doesn't
+	 * seem to much benefit in allowing sync scans.
+	 */
+	heap_setsyncscan(currentScanDesc, false);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+ParallelSeqScanState *
+ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags)
+{
+	ParallelSeqScanState *parallelscanstate;
+	ShmScanDesc			 currentShmScanDesc;
+	worker_result		 workerResult;
+	BlockNumber			 end_block;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	parallelscanstate = makeNode(ParallelSeqScanState);
+	parallelscanstate->ss.ps.plan = (Plan *) node;
+	parallelscanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &parallelscanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	parallelscanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) parallelscanstate);
+	parallelscanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) parallelscanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &parallelscanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &parallelscanstate->ss);
+	
+	/*
+	 * initialize scan relation
+	 */
+	InitParallelScanRelation(&parallelscanstate->ss, estate, eflags);
+
+	parallelscanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&parallelscanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&parallelscanstate->ss);
+
+	/*
+	 * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
+	 * here, no need to start workers.
+	 */
+	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+		return parallelscanstate;
+
+	/* Initialize the workers required to perform parallel scan. */
+	InitiateWorkers(((SeqScan *) parallelscanstate->ss.ps.plan)->scanrelid,
+					node->scan.plan.targetlist,
+					node->scan.plan.qual,
+					estate->es_range_table,
+					estate->es_param_list_info,
+					estate->es_instrument,
+					&parallelscanstate->inst_options_space,
+					&parallelscanstate->responseq,
+					&parallelscanstate->pcxt,
+					node->num_blocks_per_worker,
+					node->num_workers);
+
+	/* Initialize the blocks to be scanned by master backend. */
+	end_block = (parallelscanstate->pcxt->nworkers + 1) *
+				node->num_blocks_per_worker;
+	((SeqScan*) parallelscanstate->ss.ps.plan)->startblock =
+								end_block - node->num_blocks_per_worker;
+	/*
+	 * As master backend is the last backend to scan the blocks, it
+	 * should scan all the blocks.
+	 */
+	((SeqScan*) parallelscanstate->ss.ps.plan)->endblock = InvalidBlockNumber;
+
+	/* Set the scan limits for master backend. */
+	heap_setscanlimits(parallelscanstate->ss.ss_currentScanDesc,
+					   ((SeqScan*) parallelscanstate->ss.ps.plan)->startblock,
+					   (parallelscanstate->ss.ss_currentScanDesc->rs_nblocks -
+					   ((SeqScan*) parallelscanstate->ss.ps.plan)->startblock));
+
+	/*
+	 * Use result tuple descriptor to fetch data from shared memory queues
+	 * as the worker backends would have put the data after projection.
+	 * Number of queue's must be equal to number of worker backends.
+	 */
+	currentShmScanDesc = shm_beginscan(parallelscanstate->pcxt->nworkers);
+	workerResult = ExecInitWorkerResult(parallelscanstate->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+										parallelscanstate->pcxt->nworkers);
+
+	parallelscanstate->pss_currentShmScanDesc = currentShmScanDesc;
+	parallelscanstate->pss_workerResult	= workerResult;
+
+	return parallelscanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelSeqScan(node)
+ *
+ *		Scans the relation sequentially from multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecParallelSeqScan(ParallelSeqScanState *node)
+{
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) ParallelSeqNext,
+					(ExecScanRecheckMtd) ParallelSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndParallelSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndParallelSeqScan(ParallelSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	if (node->pcxt)
+	{
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		ExitParallelMode();
+	}
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..5107950 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -139,6 +139,40 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 									 0,
 									 NULL);
 
+	/*
+	 * set the scan limits, if requested by plan.  If the end block
+	 * is not specified, then scan all the blocks till end.
+	 */
+	if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber &&
+		((SeqScan *) node->ps.plan)->endblock != InvalidBlockNumber)
+	{
+		heap_setscanlimits(currentScanDesc,
+						   ((SeqScan *) node->ps.plan)->startblock,
+						   (((SeqScan *) node->ps.plan)->endblock -
+						   ((SeqScan *) node->ps.plan)->startblock));
+
+		/*
+		 * Each backend worker participating in parallel sequiantial
+		 * scan operate on different set of blocks, so there doesn't
+		 * seem to much benefit in allowing sync scans.
+		 */
+		heap_setsyncscan(currentScanDesc, false);
+	}
+	else if (((SeqScan *) node->ps.plan)->startblock != InvalidBlockNumber)
+	{
+		heap_setscanlimits(currentScanDesc,
+						   ((SeqScan *) node->ps.plan)->startblock,
+						   (currentScanDesc->rs_nblocks -
+						   ((SeqScan *) node->ps.plan)->startblock));
+
+		/*
+		 * Each backend worker participating in parallel sequiantial
+		 * scan operate on different set of blocks, so there doesn't
+		 * seem to much benefit in allowing sync scans.
+		 */
+		heap_setsyncscan(currentScanDesc, false);
+	}
+
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
 
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 3a93a04..f7da680 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1376,7 +1376,7 @@ SPI_cursor_open_internal(const char *name, SPIPlanPtr plan,
 	/*
 	 * Start portal execution.
 	 */
-	PortalStart(portal, paramLI, 0, snapshot);
+	PortalStart(portal, paramLI, 0, snapshot, 0);
 
 	Assert(portal->strategy != PORTAL_MULTI_QUERY);
 
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index f12f2d5..cfab8b5 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -26,6 +26,8 @@ static bool pq_mq_busy = false;
 static pid_t pq_mq_parallel_master_pid = 0;
 static pid_t pq_mq_parallel_master_backend_id = InvalidBackendId;
 
+static shm_mq_handle *pq_mq_tuple_handle = NULL;
+
 static void mq_comm_reset(void);
 static int	mq_flush(void);
 static int	mq_flush_if_writable(void);
@@ -61,6 +63,26 @@ pq_redirect_to_shm_mq(shm_mq *mq, shm_mq_handle *mqh)
 }
 
 /*
+ * Arrange to send some frontend/backend protocol messages to a shared-memory
+ * tuple message queue.
+ */
+void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh)
+{
+	pq_mq_tuple_handle = mqh;
+}
+
+/*
+ * Check if tuples can be sent through tuple shared-memory
+ * message queue.
+ */
+bool
+is_tuple_shm_mq_enabled(void)
+{
+	return pq_mq_tuple_handle ? true : false;
+}
+
+/*
  * Arrange to SendProcSignal() to the parallel master each time we transmit
  * message data via the shm_mq.
  */
@@ -161,6 +183,42 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 	return 0;
 }
 
+/*
+ * Transmit a libpq protocol message to the shared memory message queue
+ * via pq_mq_tuple_handle.  We don't include a length word, because the
+ * receiver will know the length of the message from shm_mq_receive().
+ */
+int
+mq_putmessage_direct(char msgtype, const char *s, size_t len)
+{
+	shm_mq_iovec	iov[2];
+	shm_mq_result	result;
+
+	iov[0].data = &msgtype;
+	iov[0].len = 1;
+	iov[1].data = s;
+	iov[1].len = len;
+
+	Assert(pq_mq_tuple_handle != NULL);
+
+	for (;;)
+	{
+		result = shm_mq_sendv(pq_mq_tuple_handle, iov, 2, true);
+
+		if (result != SHM_MQ_WOULD_BLOCK)
+			break;
+
+		WaitLatch(&MyProc->procLatch, WL_LATCH_SET, 0);
+		CHECK_FOR_INTERRUPTS();
+		ResetLatch(&MyProc->procLatch);
+	}
+
+	Assert(result == SHM_MQ_SUCCESS || result == SHM_MQ_DETACHED);
+	if (result != SHM_MQ_SUCCESS)
+		return EOF;
+	return 0;
+}
+
 static void
 mq_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f1a24f5..b1e1d19 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -317,6 +317,8 @@ CopyScanFields(const Scan *from, Scan *newnode)
 	CopyPlanFields((const Plan *) from, (Plan *) newnode);
 
 	COPY_SCALAR_FIELD(scanrelid);
+	COPY_SCALAR_FIELD(startblock);
+	COPY_SCALAR_FIELD(endblock);
 }
 
 /*
@@ -352,6 +354,28 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyParallelSeqScan
+ */
+static ParallelSeqScan *
+_copyParallelSeqScan(const ParallelSeqScan *from)
+{
+	ParallelSeqScan    *newnode = makeNode(ParallelSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+	COPY_SCALAR_FIELD(num_blocks_per_worker);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4039,6 +4063,9 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_ParallelSeqScan:
+			retval = _copyParallelSeqScan(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dd1278b..0b9c969 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -285,6 +285,8 @@ _outScanInfo(StringInfo str, const Scan *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_UINT_FIELD(scanrelid);
+	WRITE_UINT_FIELD(startblock);
+	WRITE_UINT_FIELD(endblock);
 }
 
 /*
@@ -437,6 +439,17 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outParallelSeqScan(StringInfo str, const ParallelSeqScan *node)
+{
+	WRITE_NODE_TYPE("PARALLELSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+	WRITE_UINT_FIELD(num_blocks_per_worker);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2851,6 +2864,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_ParallelSeqScan:
+				_outParallelSeqScan(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index 2f2f5ed..b56f6c7 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -74,3 +87,185 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+	/* sizeof(ParamListInfoData) includes the first array element */
+	size = sizeof(ParamListInfoData) +
+		(num_params - 1) * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 020558b..4abfd25 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +227,73 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_parallelseqscan
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+	double		spc_seq_page_cost;
+	QualCost	qpqual_cost;
+	Cost		cpu_per_tuple;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	if (!enable_seqscan)
+		startup_cost += disable_cost;
+
+	/* fetch estimated page cost for tablespace containing table */
+	get_tablespace_page_costs(baserel->reltablespace,
+							  NULL,
+							  &spc_seq_page_cost);
+
+	/*
+	 * disk costs
+	 */
+	run_cost += spc_seq_page_cost * baserel->pages;
+
+	/* CPU costs */
+	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
+
+	startup_cost += qpqual_cost.startup;
+	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+	run_cost += cpu_per_tuple * baserel->tuples;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..fda6f40
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,148 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "nodes/relation.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ *	IsTargetListContainNonVars -
+ *		Check if target list contain non-var entries.
+ */
+static bool
+IsTargetListContainNonVars(List *targetlist)
+{
+	ListCell   *l;
+
+	foreach(l, targetlist)
+	{
+		TargetEntry *te = (TargetEntry *) lfirst(l);
+
+		if (!IsA(te, TargetEntry))
+			continue;			/* probably should never happen */
+		if (!IsA(te->expr, Var))
+			return true;
+	}
+	return false;
+}
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/*
+	 * parallel scan is not supported for joins.
+	 */
+	if (root->simple_rel_array_size > 2)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * parallel scan is not supported for non-var target list.
+	 *
+	 * XXX - This is to keep the implementation simple, we can do this
+	 * in future.  Here we are checking by passing root->parse->targetList
+	 * instead of rel->reltargetlist because rel->targetlist always contains
+	 * Vars (refer build_base_rel_tlists).
+	 */
+	if (IsTargetListContainNonVars(root->parse->targetList))
+	   return;
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	add_path(rel, (Path *) create_parallelseqscan_path(root, rel,
+													   num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 655be81..8abad5e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,9 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_parallelseqscan_plan(PlannerInfo *root,
+										 ParallelSeqPath *best_path,
+										 List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +103,9 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static ParallelSeqScan *make_parallelseqscan(List *qptlist, List *qpqual,
+											 Index scanrelid, int nworkers,
+											 BlockNumber nblocksperworker);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +234,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +350,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_ParallelSeqScan:
+			plan = (Plan *) create_parallelseqscan_plan(root,
+														(ParallelSeqPath *) best_path,
+														tlist,
+														scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +560,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1148,65 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_worker_seqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by worker
+ *	 with restriction clauses 'qual' and targetlist 'tlist'.
+ */
+SeqScan *
+create_worker_seqscan_plan(worker_stmt *workerstmt)
+{
+	SeqScan    *scan_plan;
+
+	scan_plan = make_seqscan(workerstmt->targetList,
+							 workerstmt->qual,
+							 workerstmt->scanrelId);
+
+	scan_plan->startblock = workerstmt->startBlock;
+	scan_plan->endblock = workerstmt->endBlock;
+	return scan_plan;
+}
+
+/*
+ * create_parallelseqscan_plan
+ *	 Returns a seqscan plan for the base relation scanned by 'best_path'
+ *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ */
+static Scan *
+create_parallelseqscan_plan(PlannerInfo *root, ParallelSeqPath *best_path,
+					List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->path.param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_parallelseqscan(tlist,
+											  scan_clauses,
+											  scan_relid,
+											  best_path->num_workers,
+											  best_path->num_blocks_per_worker);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3314,6 +3388,30 @@ make_seqscan(List *qptlist,
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->scanrelid = scanrelid;
+	node->startblock = InvalidBlockNumber;
+	node->endblock = InvalidBlockNumber;
+
+	return node;
+}
+
+static ParallelSeqScan *
+make_parallelseqscan(List *qptlist,
+			   List *qpqual,
+			   Index scanrelid,
+			   int nworkers,
+			   BlockNumber nblocksperworker)
+{
+	ParallelSeqScan *node = makeNode(ParallelSeqScan);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+	node->num_blocks_per_worker = nblocksperworker;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9cbbcfb..4f8e4d3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,69 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+/*
+ * create_worker_seqscan_plannedstmt
+ *	Returns a planned statement to be used by worker for execution.
+ *	Ideally, master backend should form worker's planned statement
+ *	and pass the same to worker, however for now  master backend
+ *	just passes the required information and PlannedStmt is then
+ *	constructed by worker.
+ */
+PlannedStmt	*
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt)
+{
+	SeqScan    *scan_plan;
+	PlannedStmt	*result;
+	ListCell   *tlist;
+	Oid			reloid;
+
+	/* get the relid to save the same as part of planned statement. */
+	reloid = getrelid(workerstmt->scanrelId, workerstmt->rangetableList);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) workerstmt->qual);
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, workerstmt->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	scan_plan = create_worker_seqscan_plan(workerstmt);
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) scan_plan;
+	result->rtable = workerstmt->rangetableList;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->relationOids = lappend_oid(result->relationOids, reloid);
+	result->invalItems = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to get hasRowSecurity passed from master
+	 * backend as this is used only for invalidation and in
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 7703946..3a44aef 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -436,6 +436,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 78fb6b1..c35f934 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2163,6 +2163,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..538e612 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,41 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_parallelseqscan_path
+ *	  Creates a path corresponding to a parallel sequential scan, returning the
+ *	  pathnode.
+ */
+ParallelSeqPath *
+create_parallelseqscan_path(PlannerInfo *root, RelOptInfo *rel, int nWorkers)
+{
+	ParallelSeqPath	   *pathnode = makeNode(ParallelSeqPath);
+
+	pathnode->path.pathtype = T_ParallelSeqScan;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->num_workers = nWorkers;
+	/*
+	 * Divide the work equally among all the workers, for cases
+	 * where division is not equal (example if there are total
+	 * 10 blocks and 3 workers, then as per below calculation each
+	 * worker will scan 3 blocks), last worker will be responsible for
+	 * scanning remaining blocks.  We always consider master backend
+	 * as last worker because it will first try to get the tuples
+	 * scanned by other workers.  For calculation of number of blocks
+	 * per worker, an additional worker needs to be consider for
+	 * master backend.
+	 */
+	pathnode->num_blocks_per_worker = rel->pages / (nWorkers + 1);
+
+	cost_parallelseqscan(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..c8afe99
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,306 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitiateWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/parallel.h"
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PARALLEL_KEY_RELID			0
+#define PARALLEL_KEY_TARGETLIST		1
+#define PARALLEL_KEY_QUAL			2
+#define	PARALLEL_KEY_RANGETBL		3
+#define	PARALLEL_KEY_PARAMS			4
+#define PARALLEL_KEY_BLOCKS			5
+#define PARALLEL_KEY_INST_OPTIONS	6
+#define PARALLEL_KEY_INST_INFO		7
+#define PARALLEL_KEY_TUPLE_QUEUE	8
+#define PARALLEL_SEQSCAN_KEYS		9
+
+static void exec_worker_message(dsm_segment *seg, shm_toc *toc);
+
+/*
+ * InitiateWorkers
+ *		It sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitiateWorkers(Index scanrelId, List *targetList, List *qual,
+				List *rangeTable, ParamListInfo params, int instOptions,
+				char **inst_options_space, shm_mq_handle ***responseqp,
+				ParallelContext **pcxtp, BlockNumber numBlocksPerWorker,
+				int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	int			i;
+	Size		targetlist_len, qual_len, rangetbl_len, params_len;
+	BlockNumber	*num_blocks_per_worker;
+	Oid		   *scanreliddata;
+	char	   *targetlistdata;
+	char	   *targetlist_str;
+	char	   *qualdata;
+	char	   *qual_str;
+	char	   *rangetbldata;
+	char	   *rangetbl_str;
+	char	   *paramsdata;
+	int		   *inst_options;
+	char	   *tuple_queue_space;
+	ParallelContext *pcxt;
+	shm_mq	   *mq;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(exec_worker_message, nWorkers);
+
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(scanrelId));
+
+	targetlist_str = nodeToString(targetList);
+	targetlist_len = strlen(targetlist_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, targetlist_len);
+
+	qual_str = nodeToString(qual);
+	qual_len = strlen(qual_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, qual_len);
+
+	rangetbl_str = nodeToString(rangeTable);
+	rangetbl_len = strlen(rangetbl_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, rangetbl_len);
+
+	params_len = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, params_len);
+
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(BlockNumber));
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * nWorkers);
+
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * nWorkers);
+
+	/* keys for parallel sequence scan specific data. */
+	shm_toc_estimate_keys(&pcxt->estimator, PARALLEL_SEQSCAN_KEYS);
+
+	InitializeParallelDSM(pcxt);
+
+	/* Store scan relation id in dynamic shared memory. */
+	scanreliddata = shm_toc_allocate(pcxt->toc, sizeof(Index));
+	*scanreliddata = scanrelId;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_RELID, scanreliddata);
+
+	/* Store target list in dynamic shared memory. */
+	targetlistdata = shm_toc_allocate(pcxt->toc, targetlist_len);
+	memcpy(targetlistdata, targetlist_str, targetlist_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TARGETLIST, targetlistdata);
+
+	/* Store qual list in dynamic shared memory. */
+	qualdata = shm_toc_allocate(pcxt->toc, qual_len);
+	memcpy(qualdata, qual_str, qual_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUAL, qualdata);
+
+	/* Store range table list in dynamic shared memory. */
+	rangetbldata = shm_toc_allocate(pcxt->toc, rangetbl_len);
+	memcpy(rangetbldata, rangetbl_str, rangetbl_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_RANGETBL, rangetbldata);
+
+	/*
+	 * Store parametr's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_len);
+	SerializeBoundParams(params, params_len, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/* Store blocks to be scanned by each worker in dynamic shared memory. */
+	num_blocks_per_worker = shm_toc_allocate(pcxt->toc, sizeof(BlockNumber));
+	*num_blocks_per_worker = numBlocksPerWorker;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BLOCKS, num_blocks_per_worker);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(nWorkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+
+	/* Register backend workers. */
+	LaunchParallelWorkers(pcxt);
+
+	for (i = 0; i < pcxt->nworkers; ++i)
+		shm_mq_set_handle((*responseqp)[i], pcxt->worker[i].bgwhandle);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+
+/*
+ * exec_worker_message
+ *
+ * Execute the work assigned to a worker by master backend.
+ */
+void
+exec_worker_message(dsm_segment *seg, shm_toc *toc)
+{
+	char	    *targetlistdata;
+	char		*qualdata;
+	char		*rangetbldata;
+	char		*paramsdata;
+	char		*tuple_queue_space;
+	BlockNumber *num_blocks_per_worker;
+	BlockNumber  start_block;
+	BlockNumber  end_block;
+	int			*inst_options;
+	char		*inst_options_space;
+	char		*instrument = NULL;
+	shm_mq	    *mq;
+	shm_mq_handle *responseq;
+	Index		*scanrelId;
+	List		*targetList = NIL;
+	List		*qual = NIL;
+	List		*rangeTableList = NIL;
+	ParamListInfo params = NULL;
+	worker_stmt	*workerstmt;
+
+	scanrelId = shm_toc_lookup(toc, PARALLEL_KEY_RELID);
+	targetlistdata = shm_toc_lookup(toc, PARALLEL_KEY_TARGETLIST);
+	qualdata = shm_toc_lookup(toc, PARALLEL_KEY_QUAL);
+	rangetbldata = shm_toc_lookup(toc, PARALLEL_KEY_RANGETBL);
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	num_blocks_per_worker = shm_toc_lookup(toc, PARALLEL_KEY_BLOCKS);
+	inst_options	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(mq, MyProc);
+	responseq = shm_mq_attach(mq, seg, NULL);
+
+	end_block = (ParallelWorkerNumber + 1) * (*num_blocks_per_worker);
+	start_block = end_block - (*num_blocks_per_worker);
+
+	/* Redirect protocol messages to responseq. */
+	pq_redirect_to_tuple_shm_mq(responseq);
+
+	/* Restore targetList, qual and rangeTableList passed by main backend. */
+	targetList = (List *) stringToNode(targetlistdata);
+	qual = (List *) stringToNode(qualdata);
+	rangeTableList = (List *) stringToNode(rangetbldata);
+	params = RestoreBoundParams(paramsdata);
+
+	workerstmt = palloc(sizeof(worker_stmt));
+
+	workerstmt->scanrelId = *scanrelId;
+	workerstmt->targetList = targetList;
+	workerstmt->qual = qual;
+	workerstmt->rangetableList = rangeTableList;
+	workerstmt->params	= params;
+	workerstmt->startBlock = start_block;
+	workerstmt->inst_options = *inst_options;
+	workerstmt->instrument = instrument;
+
+	/*
+	 * Last worker should scan all the remaining blocks.
+	 *
+	 * XXX - It is possible that expected number of workers
+	 * won't get started, so to handle such cases master
+	 * backend should scan remaining blocks.
+	 */
+	workerstmt->endBlock = end_block;
+
+	/* Execute the worker command. */
+	exec_worker_stmt(workerstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 47ed84c..994eeba 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..da6e099 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -148,10 +148,19 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestRemoteExecute:
 
 			/*
-			 * We assume the commandTag is plain ASCII and therefore requires
-			 * no encoding conversion.
+			 * Send the message via shared-memory tuple queue, if the same
+			 * is enabled.
 			 */
-			pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			if (is_tuple_shm_mq_enabled())
+				mq_putmessage_direct('C', commandTag, strlen(commandTag) + 1);
+			else
+			{
+				/*
+				 * We assume the commandTag is plain ASCII and therefore requires
+				 * no encoding conversion.
+				 */
+				pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			}
 			break;
 
 		case DestNone:
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index bbad0dc..8c6946b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -55,6 +55,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1003,7 +1004,7 @@ exec_simple_query(const char *query_string)
 		/*
 		 * Start the portal.  No parameters here.
 		 */
-		PortalStart(portal, NULL, 0, InvalidSnapshot);
+		PortalStart(portal, NULL, 0, InvalidSnapshot, 0);
 
 		/*
 		 * Select the appropriate output format: text unless we are doing a
@@ -1132,6 +1133,121 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * execute_worker_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_worker_stmt(worker_stmt *workerstmt)
+{
+	Portal		portal;
+	int16		format = 1;
+	DestReceiver *receiver;
+	bool		isTopLevel = true;
+	PlannedStmt	*planned_stmt;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+	BeginCommand("SELECT", DestNone);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	planned_stmt = create_worker_seqscan_plannedstmt(workerstmt);
+
+	/*
+	 * Create unnamed portal to run the query or queries in. If there
+	 * already is one, silently drop it.
+	 */
+	portal = CreatePortal("", true, true);
+	/* Don't display the portal in pg_cursors */
+	portal->visible = false;
+
+	/*
+	 * We don't have to copy anything into the portal, because everything
+	 * we are passing here is in MessageContext, which will outlive the
+	 * portal anyway.
+	 */
+	PortalDefineQuery(portal,
+					  NULL,
+					  "",
+					  "",
+					  list_make1(planned_stmt),
+					  NULL);
+
+	/*
+	 * Start the portal.  No parameters here.
+	 */
+	PortalStart(portal,
+				workerstmt->params,
+				0,
+				InvalidSnapshot,
+				workerstmt->inst_options);
+
+	/* We always use binary format, for efficiency. */
+	PortalSetResultFormat(portal, 1, &format);
+
+	if (workerstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestRemote);
+		SetRemoteDestReceiverParams(receiver, portal);
+	}
+
+	/*
+	 * Only once the portal and destreceiver have been established can
+	 * we return to the transaction context.  All that stuff needs to
+	 * survive an internal commit inside PortalRun!
+	 */
+	MemoryContextSwitchTo(oldcontext);
+
+	/*
+	 * Run the portal to completion, and then drop it (and the receiver).
+	 */
+	(void) PortalRun(portal,
+					 FETCH_ALL,
+					 isTopLevel,
+					 receiver,
+					 receiver,
+					 NULL);
+
+
+	if (!workerstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (workerstmt->inst_options)
+		memcpy(workerstmt->instrument,
+			   portal->queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	PortalDrop(portal, false);
+
+	/*
+	 * Send appropriate CommandComplete to client.  There is no
+	 * need to send completion tag from worker as that won't be
+	 * of any use considering the completiong tag of master backend
+	 * will be used for sending to client.
+	 */
+	EndCommand("", DestRemote);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
@@ -1735,7 +1851,7 @@ exec_bind_message(StringInfo input_message)
 	/*
 	 * And we're ready to start portal execution.
 	 */
-	PortalStart(portal, params, 0, InvalidSnapshot);
+	PortalStart(portal, params, 0, InvalidSnapshot, 0);
 
 	/*
 	 * Apply the result format requests to the portal.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..5c83799 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -452,12 +452,15 @@ FetchStatementTargetList(Node *stmt)
  * presently ignored for non-PORTAL_ONE_SELECT portals (it's only intended
  * to be used for cursors).
  *
+ * The caller can also provide an options for instrumentation to be passed
+ * to CreateQueryDesc.  Most callers should simply pass zero.
+ *
  * On return, portal is ready to accept PortalRun() calls, and the result
  * tupdesc (if any) is known.
  */
 void
 PortalStart(Portal portal, ParamListInfo params,
-			int eflags, Snapshot snapshot)
+			int eflags, Snapshot snapshot, int inst_options)
 {
 	Portal		saveActivePortal;
 	ResourceOwner saveResourceOwner;
@@ -515,7 +518,7 @@ PortalStart(Portal portal, ParamListInfo params,
 											InvalidSnapshot,
 											None_Receiver,
 											params,
-											0);
+											inst_options);
 
 				/*
 				 * If it's a scrollable cursor, executor needs to support
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d9bfa25..b8f90b7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -630,6 +630,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2445,6 +2447,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2632,6 +2644,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..784cfe0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -497,6 +500,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 939d93d..71ef2c2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -115,6 +115,7 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key);
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
+extern void heap_setsyncscan(HeapScanDesc scan, bool sync_scan);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 761ba1f..00ad468 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -45,6 +45,8 @@ typedef struct ParallelContext
 
 extern bool ParallelMessagePending;
 
+extern int ParallelWorkerNumber;
+
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExtension(char *library_name,
 								  char *function_name, int nworkers);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9bb6362..3c56b49 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -105,4 +105,13 @@ typedef struct SysScanDescData
 	Snapshot	snapshot;		/* snapshot to unregister at end of scan */
 }	SysScanDescData;
 
+/* struct for scanning shared memory queues */
+typedef struct ShmScanDescData
+{
+	/* scan current state */
+	int			num_shm_queues;	/* number of shared memory queues used in scan. */
+	int			ss_cqueue;		/* current queue # in scan, if any */
+	bool		shmscan_inited;		/* false = scan not init'd yet */
+}	ShmScanDescData;
+
 #endif   /* RELSCAN_H */
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..df56cfe
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,44 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	FmgrInfo   *receive_functions;
+	Oid		   *typioparams;
+	HeapTuple  tuple;
+	int		   num_shm_queues;
+	bool	   *has_row_description;
+	bool	   *queue_detached;
+	bool	   all_queues_detached;
+	bool	   all_heap_fetched;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+typedef struct ShmScanDescData *ShmScanDesc;
+
+extern worker_result ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers);
+extern ShmScanDesc shm_beginscan(int num_queues);
+extern HeapTuple shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+							 worker_result resultState, shm_mq_handle **responseq,
+							 TupleDesc tupdesc, ScanDirection direction, bool *fromheap);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..e8522fe 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,6 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeParallelSeqscan.h b/src/include/executor/nodeParallelSeqscan.h
new file mode 100644
index 0000000..b638a24
--- /dev/null
+++ b/src/include/executor/nodeParallelSeqscan.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeparallelSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeParallelSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARALLELSEQSCAN_H
+#define NODEPARALLELSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern ParallelSeqScanState *ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecParallelSeqScan(ParallelSeqScanState *node);
+extern void ExecEndParallelSeqScan(ParallelSeqScanState *node);
+
+extern Size EstimateScanRelationIdSpace(Oid relId);
+extern void SerializeScanRelationId(Oid relId, Size maxsize,
+									char *start_address);
+extern void RestoreScanRelationId(Oid *relId, char *start_address);
+
+extern Size EstimateTargetListSpace(List *targetList);
+extern void SerializeTargetList(List *targetList, Size maxsize,
+								char *start_address);
+extern void RestoreTargetList(List **targetList, char *start_address);
+
+#endif   /* NODEPARALLELSEQSCAN_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 48f84bf..e5dec1e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -127,6 +127,8 @@ typedef struct TupleTableSlot
 	MinimalTuple tts_mintuple;	/* minimal tuple, or NULL if none */
 	HeapTupleData tts_minhdr;	/* workspace for minimal-tuple-only case */
 	long		tts_off;		/* saved state for slot_deform_tuple */
+	bool		tts_fromheap;	/* indicates whether the tuple is fetched from
+								   heap or shrared memory message queue */
 } TupleTableSlot;
 
 #define TTS_HAS_PHYSICAL_TUPLE(slot)  \
diff --git a/src/include/libpq/pqmq.h b/src/include/libpq/pqmq.h
index ad7589d..067edbe 100644
--- a/src/include/libpq/pqmq.h
+++ b/src/include/libpq/pqmq.h
@@ -19,6 +19,13 @@
 extern void	pq_redirect_to_shm_mq(shm_mq *, shm_mq_handle *);
 extern void pq_set_parallel_master(pid_t pid, BackendId backend_id);
 
+extern int
+mq_putmessage_direct(char msgtype, const char *s, size_t len);
+extern void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh);
+extern bool
+is_tuple_shm_mq_enabled(void);
+
 extern void pq_parse_errornotice(StringInfo str, ErrorData *edata);
 
 #endif   /* PQMQ_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41288ed..844a9eb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,12 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -1212,6 +1215,24 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * ParallelScanState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct ParallelScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle **responseq;
+	ShmScanDesc pss_currentShmScanDesc;
+	worker_result	pss_workerResult;
+	char	*inst_options_space;
+} ParallelScanState;
+
+typedef ParallelScanState ParallelSeqScanState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 97ef0fc..b6f1493 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_ParallelSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +98,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_ParallelSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +219,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_ParallelSeqPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index 5b096c5..eb8c86a 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b1dfa85..f08448f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,9 +20,13 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
+#include "nodes/params.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "nodes/params.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +160,19 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for execution. */
+typedef struct worker_stmt
+{
+	Index		scanrelId;
+	List		*targetList;
+	List		*qual;
+	List		*rangetableList;
+	ParamListInfo params;
+	BlockNumber startBlock;
+	BlockNumber endBlock;
+	int			inst_options;
+	char		*instrument;
+} worker_stmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 316c9ce..3354398 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,7 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -269,6 +270,8 @@ typedef struct Scan
 {
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+	BlockNumber startblock;		/* block to start seq scan */
+	BlockNumber endblock;		/* block upto which scan has to be done */
 } Scan;
 
 /* ----------------
@@ -278,6 +281,17 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct ParallelSeqScan
+{
+	Scan		scan;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqScan;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..576add5 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct ParallelSeqPath
+{
+	Path		path;
+	int			num_workers;
+	BlockNumber	num_blocks_per_worker;
+} ParallelSeqPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..0b6a469 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..32c3e0d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern ParallelSeqPath *create_parallelseqscan_path(PlannerInfo *root,
+					RelOptInfo *rel, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 082f7d7..a4faf32 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -41,6 +41,8 @@ extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  * prototypes for plan/createplan.c
  */
 extern Plan *create_plan(PlannerInfo *root, Path *best_path);
+extern SeqScan *
+create_worker_seqscan_plan(worker_stmt *workerstmt);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..91ddffe 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *
+create_worker_seqscan_plannedstmt(worker_stmt *workerstmt);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..c0b9b42
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,33 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+extern int	parallel_seqscan_degree;
+extern void InitiateWorkers(Index scanrelId, List *targetList,
+							List *qual, List *rangeTable,
+							ParamListInfo params,
+							int instOptions,
+							char **inst_options_space,
+							shm_mq_handle ***responseqp,
+							ParallelContext **pcxtp,
+							BlockNumber numBlocksPerWorker,
+							int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/pquery.h b/src/include/tcop/pquery.h
index 8073a6e..d14d876 100644
--- a/src/include/tcop/pquery.h
+++ b/src/include/tcop/pquery.h
@@ -28,7 +28,7 @@ extern List *FetchPortalTargetList(Portal portal);
 extern List *FetchStatementTargetList(Node *stmt);
 
 extern void PortalStart(Portal portal, ParamListInfo params,
-			int eflags, Snapshot snapshot);
+			int eflags, Snapshot snapshot, int inst_options);
 
 extern void PortalSetResultFormat(Portal portal, int nFormats,
 					  int16 *formats);
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 0a350fd..02cf518 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -83,5 +83,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_worker_stmt(worker_stmt *workerstmt);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#151Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#150)
Re: Parallel Seq Scan

On Fri, Feb 6, 2015 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here is the latest patch which fixes reported issues and supported
Prepared Statements and Explain Statement for parallel sequential
scan.

The main purpose is to get the feedback if possible on overall
structure/design of code before I goahead.

I'm not very happy with the way this is modularized:

1. The new parallel sequential scan node runs only in the master. The
workers are running a regular sequential scan with a hack to make them
scan a subset of the blocks. I think this is wrong; parallel
sequential scan shouldn't require this kind of modifications to the
non-parallel case.

2. InitiateWorkers() is entirely specific to the concerns of parallel
sequential scan. After looking this over, I think there are three
categories of things that need to be clearly separated. Some stuff is
going to be needed for any parallel query; some stuff is going to be
needed only for parallel scans but will be needed for any type of
parallel scan, not just parallel sequential scan[1]It is of course arguable whether a parallel index-scan or parallel bitmap index-scan or parallel index-only-scan or parallel custom scan makes sense, but this patch shouldn't assume that we won't want to do those things. We have other places in the code that know about the concept of a scan as opposed to some other kind of executor construct, and we should preserve that distinction here.; some stuff is
needed for any type of node that returns tuples but not for nodes that
don't return tuples (e.g. needed for ParallelSeqScan and
ParallelHashJoin, but not needed for ParallelHash); and some stuff is
only going to be needed for parallel sequential scan specifically.
This patch mixes all of those concerns together in a single function.
That won't do; this needs to be easily extensible to whatever someone
wants to parallelize next.

3. I think the whole idea of using the portal infrastructure for this
is wrong. We've talked about this before, but the fact that you're
doing it this way is having a major impact on the whole design of the
patch, and I can't believe it's ever going to be committable this way.
To create a portal, you have to pretend that you received a protocol
message, which you didn't; and you have to pretend there is an SQL
query so you can call PortalDefineQuery. That's ugly. As far as I
can see the only thing we really get out of any of that is that we can
use the DestReceiver stuff to get the tuples back to the master, but
that doesn't really work either, because you're having to hack
printtup.c anyway. So from my point of view you're going through a
bunch of layers that really don't have any value. Considering the way
the parallel mode patch has evolved, I no longer think there's much
point to passing anything other than raw tuples between the backends,
so the whole idea of going through a deform/send/recv/form cycle seems
like something we can entirely skip.

4.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]: It is of course arguable whether a parallel index-scan or parallel bitmap index-scan or parallel index-only-scan or parallel custom scan makes sense, but this patch shouldn't assume that we won't want to do those things. We have other places in the code that know about the concept of a scan as opposed to some other kind of executor construct, and we should preserve that distinction here.
bitmap index-scan or parallel index-only-scan or parallel custom scan
makes sense, but this patch shouldn't assume that we won't want to do
those things. We have other places in the code that know about the
concept of a scan as opposed to some other kind of executor construct,
and we should preserve that distinction here.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#151)
Re: Parallel Seq Scan

On Fri, Feb 6, 2015 at 12:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

4.

Obviously that went out a bit too soon. Anyway, what I think we
should do here is back up a bit and talk about what the problems are
that we need to solve here and how each of them should be solved. I
think there is some good code in this patch, but we really need to
think about what the interfaces should look like and achieve a clean
separation of concerns.

Looking at the code for the non-parallel SeqScan node, there are
basically two things going on here:

1. We call heap_getnext() to get the next tuple and store it into a
TupleTableSlot.
2. Via ExecScan(), we do projection and apply quals.

My first comment here is that I think we should actually teach
heapam.c about parallelism. In other words, let's have an interface
like this:

extern Size heap_parallelscan_estimate(Snapshot snapshot);
extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
Relation relation, Snapshot snapshot);
extern HeapScanDesc heap_beginscan_parallel(ParallelHeapScanDesc);

So the idea is that if you want to do a parallel scan, you call
heap_parallelscan_estimate() to figure out how much space to reserve
in your dynamic shared memory segment. Then you call
heap_parallelscan_initialize() to initialize the chunk of memory once
you have it. Then each backend that wants to assist in the parallel
scan calls heap_beginscan_parallel() on that chunk of memory and gets
its own HeapScanDesc. Then, they can all call heap_getnext() just as
in the non-parallel case. The ParallelHeapScanDesc needs to contain
the relation OID, the snapshot, the ending block number, and a
current-block counter. Instead of automatically advancing to the next
block, they use one of Andres's nifty new atomic ops to bump the
current-block counter and then scan the block just before the new
value. All this seems pretty straightforward, and if we decide to
later change the way the relation gets scanned (e.g. in 1GB chunks
rather than block-by-block) it can be handled here pretty easily.

Now, let's suppose that we have this interface and for some reason we
don't care about quals and projection - we just want to get the tuples
back to the master. It's easy enough to create a parallel context
that fires up a worker and lets the worker call
heap_beginscan_parallel() and then heap_getnext() in a loop, but what
does it do with the resulting tuples? We need a tuple queue that can
be used to send the tuples back to master. That's also pretty easy:
just set up a shm_mq and use shm_mq_send() to send each tuple. Use
shm_mq_receive() in the master to read them back out. The only thing
we need to be careful about is that the tuple descriptors match. It
must be that they do, because the way the current parallel context
patch works, the master is guaranteed to hold a lock on the relation
from before the worker starts up until after it dies. But we could
stash the tuple descriptor in shared memory and cross-check that it
matches just to be sure. Anyway, this doesn't seem terribly complex
although we might want to wrap some abstraction around it somehow so
that every kind of parallelism that uses tuple queues can benefit from
it. Perhaps this could even be built into the parallel context
machinery somehow, or maybe it's something executor-specific. At any
rate it looks simpler than what you've got now.

The complicated part here seems to me to figure out what we need to
pass from the parallel leader to the parallel worker to create enough
state for quals and projection. If we want to be able to call
ExecScan() without modification, which seems like a good goal, we're
going to need a ScanState node, which is going to need to contain
valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
List of quals. That in turn is going to require an ExecutorState.
Serializing those things directly doesn't seem very practical; what we
instead want to do is figure out what we can pass that will allow easy
reconstruction of those data structures. Right now, you're passing
the target list, the qual list, the range table, and the params, but
the range table doesn't seem to be getting used anywhere. I wonder if
we need it. If we could get away with just passing the target list
and qual list, and params, we'd be doing pretty well, I think. But
I'm not sure exactly what that looks like.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#153Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#152)
Re: Parallel Seq Scan

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The complicated part here seems to me to figure out what we need to
pass from the parallel leader to the parallel worker to create enough
state for quals and projection. If we want to be able to call
ExecScan() without modification, which seems like a good goal, we're
going to need a ScanState node, which is going to need to contain
valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
List of quals. That in turn is going to require an ExecutorState.
Serializing those things directly doesn't seem very practical; what we
instead want to do is figure out what we can pass that will allow easy
reconstruction of those data structures. Right now, you're passing
the target list, the qual list, the range table, and the params, but
the range table doesn't seem to be getting used anywhere. I wonder if
we need it. If we could get away with just passing the target list
and qual list, and params, we'd be doing pretty well, I think. But
I'm not sure exactly what that looks like.

IndexBuildHeapRangeScan shows how to do qual evaluation with
relatively little setup:

estate = CreateExecutorState();
econtext = GetPerTupleExprContext(estate);
slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));

/* Arrange for econtext's scan tuple to be the tuple under test */
econtext->ecxt_scantuple = slot;

/* Set up execution state for predicate, if any. */
predicate = (List *)
ExecPrepareExpr((Expr *) indexInfo->ii_Predicate,
estate);

Then, for each tuple:

ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);

And:

if (!ExecQual(predicate, econtext, false))
continue;

This looks like a good model to follow for parallel sequential scan.
The point though is that I think we should do it directly rather than
letting the portal machinery do it for us. Not sure how to get
projection working yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#154Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#152)
2 attachment(s)
Re: Parallel Seq Scan

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

My first comment here is that I think we should actually teach
heapam.c about parallelism.

I coded this up; see attached. I'm also attaching an updated version
of the parallel count code revised to use this API. It's now called
"parallel_count" rather than "parallel_dummy" and I removed some
stupid stuff from it. I'm curious to see what other people think, but
this seems much cleaner to me. With the old approach, the
parallel-count code was duplicating some of the guts of heapam.c and
dropping the rest on the floor; now it just asks for a parallel scan
and away it goes. Similarly, if your parallel-seqscan patch wanted to
scan block-by-block rather than splitting the relation into equal
parts, or if it wanted to participate in the synchronized-seqcan
stuff, there was no clean way to do that. With this approach, those
decisions are - as they quite properly should be - isolated within
heapam.c, rather than creeping into the executor.

(These patches should be applied over parallel-mode-v4.patch.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-heap-scan.patchapplication/x-patch; name=parallel-heap-scan.patchDownload
commit 096b3d5bdb4df5de095104fd3f58efa97e08a2ff
Author: Robert Haas <rhaas@postgresql.org>
Date:   Fri Feb 6 21:19:40 2015 -0500

    Support parallel heap scans.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 50bede8..abfe8c2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -62,6 +62,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -79,8 +80,10 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat, bool allow_sync,
 						bool is_bitmapscan, bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(ParallelHeapScanDesc);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -221,7 +224,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -480,7 +486,18 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -503,6 +520,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -655,11 +675,19 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -757,7 +785,18 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -777,6 +816,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -918,11 +960,19 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1303,7 +1353,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, false, false);
 }
 
@@ -1313,7 +1363,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, false, true);
 }
 
@@ -1322,7 +1372,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, false, false);
 }
 
@@ -1330,13 +1380,14 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, false);
 }
 
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat, bool allow_sync,
 						bool is_bitmapscan, bool temp_snap)
 {
@@ -1364,6 +1415,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1457,6 +1509,79 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = 0;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		A return value larger than the number of blocks to be scanned
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(ParallelHeapScanDesc parallel_scan)
+{
+	BlockNumber	page = InvalidBlockNumber;
+
+	/* we treat InvalidBlockNumber specially here to avoid overflow */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock != InvalidBlockNumber)
+		page = parallel_scan->phs_cblock++;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot		snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, false, true);
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 939d93d..fb2b5f0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -95,8 +95,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -119,6 +120,11 @@ extern void heap_rescan(HeapScanDesc scan, ScanKey key);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9bb6362..f459020 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,15 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber	phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -48,6 +57,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel; /* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
parallel-count.patchapplication/x-patch; name=parallel-count.patchDownload
commit 8d6ad4e1551252e17b7d7609f42f7a24921a2a31
Author: Robert Haas <rhaas@postgresql.org>
Date:   Fri Jan 30 08:39:22 2015 -0500

    contrib/parallel_count, now using heap_parallel_beginscan

diff --git a/contrib/parallel_count/Makefile b/contrib/parallel_count/Makefile
new file mode 100644
index 0000000..221c569
--- /dev/null
+++ b/contrib/parallel_count/Makefile
@@ -0,0 +1,19 @@
+MODULE_big = parallel_count
+OBJS = parallel_count.o $(WIN32RES)
+PGFILEDESC = "parallel_count - simple parallel tuple counter"
+
+EXTENSION = parallel_count
+DATA = parallel_count--1.0.sql
+
+REGRESS = parallel_count
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/parallel_count
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/parallel_count/parallel_count--1.0.sql b/contrib/parallel_count/parallel_count--1.0.sql
new file mode 100644
index 0000000..a8a6266
--- /dev/null
+++ b/contrib/parallel_count/parallel_count--1.0.sql
@@ -0,0 +1,7 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION parallel_count" to load this file. \quit
+
+CREATE FUNCTION parallel_count(rel pg_catalog.regclass,
+							  nworkers pg_catalog.int4)
+    RETURNS pg_catalog.int8 STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/contrib/parallel_count/parallel_count.c b/contrib/parallel_count/parallel_count.c
new file mode 100644
index 0000000..06a5ec3
--- /dev/null
+++ b/contrib/parallel_count/parallel_count.c
@@ -0,0 +1,154 @@
+/*--------------------------------------------------------------------------
+ *
+ * parallel_count.c
+ *		simple parallel tuple counter
+ *
+ * Copyright (C) 2013-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/parallel_count/parallel_count.c
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/xact.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+#include "utils/tqual.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(parallel_count);
+
+#define		KEY_SCAN			1
+#define		KEY_RESULT			2
+
+void		_PG_init(void);
+void		count_worker_main(dsm_segment *seg, shm_toc *toc);
+
+static void count_helper(Relation rel, ParallelHeapScanDesc scan,
+						 int64 *result);
+
+Datum
+parallel_count(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int32		nworkers = PG_GETARG_INT32(1);
+	int32		i;
+	bool		already_in_parallel_mode = IsInParallelMode();
+	ParallelContext *pcxt;
+	Snapshot	snapshot;
+	Size		pscan_size;
+	ParallelHeapScanDesc pscan;
+	Relation	rel;
+	int64	   *result = NULL;
+	int64		total = 0;
+
+	if (nworkers < 0)
+		ereport(ERROR,
+				(errmsg("number of parallel workers must be non-negative")));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContextForExtension("parallel_count",
+											 "count_worker_main",
+											 nworkers);
+
+	snapshot = GetActiveSnapshot();
+	pscan_size = heap_parallelscan_estimate(snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, pscan_size);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	if (nworkers > 0)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator, nworkers * sizeof(int64));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	InitializeParallelDSM(pcxt);
+
+	pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(pscan, rel, snapshot);
+	shm_toc_insert(pcxt->toc, KEY_SCAN, pscan);
+	if (nworkers > 0)
+	{
+		result = shm_toc_allocate(pcxt->toc, nworkers * sizeof(int64));
+		shm_toc_insert(pcxt->toc, KEY_RESULT, result);
+	}
+
+	LaunchParallelWorkers(pcxt);
+
+	/* here's where we do the "real work" ... */
+	count_helper(rel, pscan, &total);
+
+	WaitForParallelWorkersToFinish(pcxt);
+
+	for (i = 0; i < nworkers; ++i)
+		total += result[i];
+
+	DestroyParallelContext(pcxt);
+
+	relation_close(rel, AccessShareLock);
+
+	if (!already_in_parallel_mode)
+		ExitParallelMode();
+
+	PG_RETURN_INT64(total);
+}
+
+void
+count_worker_main(dsm_segment *seg, shm_toc *toc)
+{
+	ParallelHeapScanDesc	pscan;
+	int64	   *result;
+	Relation	rel;
+
+	pscan = shm_toc_lookup(toc, KEY_SCAN);
+	Assert(pscan != NULL);
+
+	result = shm_toc_lookup(toc, KEY_RESULT);
+	Assert(result != NULL);
+
+	rel = relation_open(pscan->phs_relid, AccessShareLock);
+	count_helper(rel, pscan, &result[ParallelWorkerNumber]);
+	relation_close(rel, AccessShareLock);
+}
+
+static void
+count_helper(Relation rel, ParallelHeapScanDesc pscan, int64 *result)
+{
+	int64		ntuples = 0;
+	HeapScanDesc	scan;
+
+	scan = heap_beginscan_parallel(rel, pscan);
+
+	for (;;)
+	{
+		HeapTuple	tuple;
+
+		CHECK_FOR_INTERRUPTS();
+
+		tuple = heap_getnext(scan, ForwardScanDirection);
+		if (!HeapTupleIsValid(tuple))
+			break;
+
+		++ntuples;
+	}
+
+	heap_endscan(scan);
+
+	*result = ntuples;
+	elog(NOTICE, "PID %d counted " INT64_FORMAT " tuples", MyProcPid, ntuples);
+}
diff --git a/contrib/parallel_count/parallel_count.control b/contrib/parallel_count/parallel_count.control
new file mode 100644
index 0000000..76f332d
--- /dev/null
+++ b/contrib/parallel_count/parallel_count.control
@@ -0,0 +1,4 @@
+comment = 'simple parallel tuple counter'
+default_version = '1.0'
+module_pathname = '$libdir/parallel_count'
+relocatable = true
#155Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#154)
Re: Parallel Seq Scan

On 2015-02-06 22:57:43 -0500, Robert Haas wrote:

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

My first comment here is that I think we should actually teach
heapam.c about parallelism.

I coded this up; see attached. I'm also attaching an updated version
of the parallel count code revised to use this API. It's now called
"parallel_count" rather than "parallel_dummy" and I removed some
stupid stuff from it. I'm curious to see what other people think, but
this seems much cleaner to me. With the old approach, the
parallel-count code was duplicating some of the guts of heapam.c and
dropping the rest on the floor; now it just asks for a parallel scan
and away it goes. Similarly, if your parallel-seqscan patch wanted to
scan block-by-block rather than splitting the relation into equal
parts, or if it wanted to participate in the synchronized-seqcan
stuff, there was no clean way to do that. With this approach, those
decisions are - as they quite properly should be - isolated within
heapam.c, rather than creeping into the executor.

I'm not convinced that that reasoning is generally valid. While it may
work out nicely for seqscans - which might be useful enough on its own -
the more stuff we parallelize the *more* the executor will have to know
about it to make it sane. To actually scale nicely e.g. a parallel sort
will have to execute the nodes below it on each backend, instead of
doing that in one as a separate step, ferrying over all tuples to
indivdual backends through queues, and only then parallezing the
sort.

Now. None of that is likely to matter immediately, but I think starting
to build the infrastructure at the points where we'll later need it does
make some sense.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#156Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#155)
Re: Parallel Seq Scan

On Sat, Feb 7, 2015 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2015-02-06 22:57:43 -0500, Robert Haas wrote:

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

My first comment here is that I think we should actually teach
heapam.c about parallelism.

I coded this up; see attached. I'm also attaching an updated version
of the parallel count code revised to use this API. It's now called
"parallel_count" rather than "parallel_dummy" and I removed some
stupid stuff from it. I'm curious to see what other people think, but
this seems much cleaner to me. With the old approach, the
parallel-count code was duplicating some of the guts of heapam.c and
dropping the rest on the floor; now it just asks for a parallel scan
and away it goes. Similarly, if your parallel-seqscan patch wanted to
scan block-by-block rather than splitting the relation into equal
parts, or if it wanted to participate in the synchronized-seqcan
stuff, there was no clean way to do that. With this approach, those
decisions are - as they quite properly should be - isolated within
heapam.c, rather than creeping into the executor.

I'm not convinced that that reasoning is generally valid. While it may
work out nicely for seqscans - which might be useful enough on its own -
the more stuff we parallelize the *more* the executor will have to know
about it to make it sane. To actually scale nicely e.g. a parallel sort
will have to execute the nodes below it on each backend, instead of
doing that in one as a separate step, ferrying over all tuples to
indivdual backends through queues, and only then parallezing the
sort.

Now. None of that is likely to matter immediately, but I think starting
to build the infrastructure at the points where we'll later need it does
make some sense.

Well, I agree with you, but I'm not really sure what that has to do
with the issue at hand. I mean, if we were to apply Amit's patch,
we'd been in a situation where, for a non-parallel heap scan, heapam.c
decides the order in which blocks get scanned, but for a parallel heap
scan, nodeParallelSeqscan.c makes that decision. Maybe I'm an old
fuddy-duddy[1]Actually, there's not really any "maybe" about this. but that seems like an abstraction violation to me. I
think the executor should see a parallel scan as a stream of tuples
that streams into a bunch of backends in parallel, without really
knowing how heapam.c is dividing up the work. That's how it's
modularized today, and I don't see a reason to change it. Do you?

Regarding tuple flow between backends, I've thought about that before,
I agree that we need it, and I don't think I know how to do it. I can
see how to have a group of processes executing a single node in
parallel, or a single process executing a group of nodes we break off
from the query tree and push down to it, but what you're talking about
here is a group of processes executing a group of nodes jointly. That
seems like an excellent idea, but I don't know how to design it.
Actually routing the tuples between whichever backends we want to
exchange them between is easy enough, but how do we decide whether to
generate such a plan? What does the actual plan tree look like?
Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
scan, mostly, I would think) and can-absorb-parallel-tuple-streams
(sort, hash, materialize), or something like that, but I'm really
fuzzy on the details.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]: Actually, there's not really any "maybe" about this.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#157Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#152)
Re: Parallel Seq Scan

On Sat, Feb 7, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The complicated part here seems to me to figure out what we need to
pass from the parallel leader to the parallel worker to create enough
state for quals and projection. If we want to be able to call
ExecScan() without modification, which seems like a good goal, we're
going to need a ScanState node, which is going to need to contain
valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
List of quals. That in turn is going to require an ExecutorState.
Serializing those things directly doesn't seem very practical; what we
instead want to do is figure out what we can pass that will allow easy
reconstruction of those data structures. Right now, you're passing
the target list, the qual list, the range table, and the params, but
the range table doesn't seem to be getting used anywhere. I wonder if
we need it.

The range table is used by executor for processing qualification, one of
the examples is ExecEvalWholeRowVar(), I don't think we can process
without range table. Apart from above mentioned things we need to pass
Instrumentation structure where each worker needs to update the same,
this is required for Explain statement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#158Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#156)
Re: Parallel Seq Scan

On Sun, Feb 8, 2015 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Feb 7, 2015 at 4:30 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On 2015-02-06 22:57:43 -0500, Robert Haas wrote:

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

My first comment here is that I think we should actually teach
heapam.c about parallelism.

I coded this up; see attached. I'm also attaching an updated version
of the parallel count code revised to use this API. It's now called
"parallel_count" rather than "parallel_dummy" and I removed some
stupid stuff from it. I'm curious to see what other people think, but
this seems much cleaner to me. With the old approach, the
parallel-count code was duplicating some of the guts of heapam.c and
dropping the rest on the floor; now it just asks for a parallel scan
and away it goes. Similarly, if your parallel-seqscan patch wanted to
scan block-by-block rather than splitting the relation into equal
parts, or if it wanted to participate in the synchronized-seqcan
stuff, there was no clean way to do that. With this approach, those
decisions are - as they quite properly should be - isolated within
heapam.c, rather than creeping into the executor.

I'm not convinced that that reasoning is generally valid. While it may
work out nicely for seqscans - which might be useful enough on its own -
the more stuff we parallelize the *more* the executor will have to know
about it to make it sane. To actually scale nicely e.g. a parallel sort
will have to execute the nodes below it on each backend, instead of
doing that in one as a separate step, ferrying over all tuples to
indivdual backends through queues, and only then parallezing the
sort.

Now. None of that is likely to matter immediately, but I think starting
to build the infrastructure at the points where we'll later need it does
make some sense.

I think doing it for parallel seq scan as well makes the processing for
worker much more easier like processing for prepared queries
(bind parameters), processing of Explain statement, Qualification,
Projection, decision for processing of junk entries.

Well, I agree with you, but I'm not really sure what that has to do
with the issue at hand. I mean, if we were to apply Amit's patch,
we'd been in a situation where, for a non-parallel heap scan, heapam.c
decides the order in which blocks get scanned, but for a parallel heap
scan, nodeParallelSeqscan.c makes that decision.

I think other places also decides about the order/way heapam.c has
to scan, example the order in which rows/pages has to traversed is
decided at portal/executor layer and the same is passed till heap and
in case of index, the scanlimits (heap_setscanlimits()) are decided
outside heapam.c and something similar is done for parallel seq scan.
In general, the scan is driven by Scandescriptor which is constructed
at upper level and there are some API's exposed to derive the scan.
If you are not happy with the current way nodeParallelSeqscan has
set the scan limits, we can have some form of callback which do the
required work and this callback can be called from heapam.c.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#159Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#158)
Re: Parallel Seq Scan

On Sat, Feb 7, 2015 at 10:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Well, I agree with you, but I'm not really sure what that has to do
with the issue at hand. I mean, if we were to apply Amit's patch,
we'd been in a situation where, for a non-parallel heap scan, heapam.c
decides the order in which blocks get scanned, but for a parallel heap
scan, nodeParallelSeqscan.c makes that decision.

I think other places also decides about the order/way heapam.c has
to scan, example the order in which rows/pages has to traversed is
decided at portal/executor layer and the same is passed till heap and
in case of index, the scanlimits (heap_setscanlimits()) are decided
outside heapam.c and something similar is done for parallel seq scan.
In general, the scan is driven by Scandescriptor which is constructed
at upper level and there are some API's exposed to derive the scan.
If you are not happy with the current way nodeParallelSeqscan has
set the scan limits, we can have some form of callback which do the
required work and this callback can be called from heapam.c.

I thought about a callback, but what's the benefit of doing that vs.
hard-coding it in heapam.c? If the upper-layer wants to impose a TID
qual or similar then heap_setscanlimits() makes sense, but that's
effectively a filter condition, not a policy decision about the access
pattern.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#160Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#153)
Re: Parallel Seq Scan

On Sat, Feb 7, 2015 at 2:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The complicated part here seems to me to figure out what we need to
pass from the parallel leader to the parallel worker to create enough
state for quals and projection. If we want to be able to call
ExecScan() without modification, which seems like a good goal, we're
going to need a ScanState node, which is going to need to contain
valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
List of quals. That in turn is going to require an ExecutorState.
Serializing those things directly doesn't seem very practical; what we
instead want to do is figure out what we can pass that will allow easy
reconstruction of those data structures. Right now, you're passing
the target list, the qual list, the range table, and the params, but
the range table doesn't seem to be getting used anywhere. I wonder if
we need it. If we could get away with just passing the target list
and qual list, and params, we'd be doing pretty well, I think. But
I'm not sure exactly what that looks like.

IndexBuildHeapRangeScan shows how to do qual evaluation with
relatively little setup:

I think even to make quals work, we need to do few extra things
like setup paramlist, rangetable. Also for quals, we need to fix
function id's by calling fix_opfuncids() and do the stuff what
ExecInit*() function does for quals. I think these extra things
will be required in processing of qualification for seq scan.

Then we need to construct projection info from target list (basically
do the stuff what ExecInit*() function does). After constructing
projectioninfo, we can call ExecProject().

Here we need to take care that functions to collect instrumentation
information like InstrStartNode(), InstrStopNode(), InstrCountFiltered1(),
etc. be called at appropriate places, so that we can collect the same for
Explain statement when requested by master backend.

Then finally after sending tuples need to destroy all the execution
state constructed for fetching tuples.

So to make this work, basically we need to do all important work
that executor does in three different phases initialization of
node, execution of node, ending the node. Ideally, we can make this
work by having code specific to just execution of sequiatial scan,
however it seems to me we again need more such kinds of code
(extracted from core part of executor) to make parallel execution of
other functionalaties like aggregation, partition seq scan, etc.

Another idea is to use Executor level interfaces (like ExecutorStart(),
ExecutorRun(), ExecutorEnd()) for execution rather than using Portal
level interfaces. I have used Portal level interfaces with the
thought that we can reuse the existing infrastructure of Portal to
make parallel execution of scrollable cursors, but as per my analysis
it is not so easy to support them especially backward scan, absolute/
relative fetch, etc, so Executor level interfaces seems more appealing
to me (something like how Explain statement works (ExplainOnePlan)).
Using Executor level interfaces will have advantage that we can reuse them
for other parallel functionalaties. In this approach, we need to take
care of constructing relavant structures (with the information passed by
master backend) required for Executor interfaces, but I think these should
be lesser than what we need in previous approach (extract seqscan specific
stuff from executor).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#161Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#151)
Re: Parallel Seq Scan

On Fri, Feb 6, 2015 at 11:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 6, 2015 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Here is the latest patch which fixes reported issues and supported
Prepared Statements and Explain Statement for parallel sequential
scan.

The main purpose is to get the feedback if possible on overall
structure/design of code before I goahead.

2. InitiateWorkers() is entirely specific to the concerns of parallel
sequential scan. After looking this over, I think there are three
categories of things that need to be clearly separated. Some stuff is
going to be needed for any parallel query; some stuff is going to be
needed only for parallel scans but will be needed for any type of
parallel scan, not just parallel sequential scan[1]; some stuff is
needed for any type of node that returns tuples but not for nodes that
don't return tuples (e.g. needed for ParallelSeqScan and
ParallelHashJoin, but not needed for ParallelHash); and some stuff is
only going to be needed for parallel sequential scan specifically.
This patch mixes all of those concerns together in a single function.
That won't do; this needs to be easily extensible to whatever someone
wants to parallelize next.

Master backend shares Targetlist, Qual, Scanrelid, Rangetable, Bind Params,
Info about Scan range (Blocks), Tuple queues, Instrumentation Info
to worker, going by your suggestion, I think we can separate them as below:

1. parallel query - Target list, Qual, Bind Params, Instrumentation Info
2. parallel scan and nodes that returns tuples - scanrelid, range table,
Tuple Queues
3. parallel sequiantial scan specific - Info about Scan range (Blocks)

This is as per current list of things which master backend shares with
worker,
if more things are required, then we can decide in which category it falls
and
add it accordingly.

Is this similar to what you have in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#162Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#159)
Re: Parallel Seq Scan

On Sun, Feb 8, 2015 at 11:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Feb 7, 2015 at 10:36 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Well, I agree with you, but I'm not really sure what that has to do
with the issue at hand. I mean, if we were to apply Amit's patch,
we'd been in a situation where, for a non-parallel heap scan, heapam.c
decides the order in which blocks get scanned, but for a parallel heap
scan, nodeParallelSeqscan.c makes that decision.

I think other places also decides about the order/way heapam.c has
to scan, example the order in which rows/pages has to traversed is
decided at portal/executor layer and the same is passed till heap and
in case of index, the scanlimits (heap_setscanlimits()) are decided
outside heapam.c and something similar is done for parallel seq scan.
In general, the scan is driven by Scandescriptor which is constructed
at upper level and there are some API's exposed to derive the scan.
If you are not happy with the current way nodeParallelSeqscan has
set the scan limits, we can have some form of callback which do the
required work and this callback can be called from heapam.c.

I thought about a callback, but what's the benefit of doing that vs.
hard-coding it in heapam.c?

Basically I want to address your concern of setting scan limit via
sequence scan node, one of the ways could be that pass a callback_function
and callback_state to heap_beginscan which will remember that information
in HeapScanDesc and then use in heap_getnext(), now callback_state will
have info about next page which will be updated by callback_function.

We can remember callback_function and callback_state information in
estate which will be set only by parallel worker which means it won't effect
non-parallel case. I think this will be helpful in future as well where we
want
particular scan or sort to use that information to behave as parallel scan
or
sort.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#163Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#160)
Re: Parallel Seq Scan

On Mon, Feb 9, 2015 at 2:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Another idea is to use Executor level interfaces (like ExecutorStart(),
ExecutorRun(), ExecutorEnd()) for execution rather than using Portal
level interfaces. I have used Portal level interfaces with the
thought that we can reuse the existing infrastructure of Portal to
make parallel execution of scrollable cursors, but as per my analysis
it is not so easy to support them especially backward scan, absolute/
relative fetch, etc, so Executor level interfaces seems more appealing
to me (something like how Explain statement works (ExplainOnePlan)).
Using Executor level interfaces will have advantage that we can reuse them
for other parallel functionalaties. In this approach, we need to take
care of constructing relavant structures (with the information passed by
master backend) required for Executor interfaces, but I think these should
be lesser than what we need in previous approach (extract seqscan specific
stuff from executor).

I think using the executor-level interfaces instead of the
portal-level interfaces is a good idea. That would possibly let us
altogether prohibit access to the portal layer from within a parallel
worker, which seems like it might be a good sanity check to add. But
that seems to still require us to have a PlannedStmt and a QueryDesc,
and I'm not sure whether that's going to be too much of a pain. We
might need to think about an alternative API for starting the Executor
like ExecutorStartParallel() or ExecutorStartExtended(). But I'm not
sure. If you can revise things to go through the executor interfaces
I think that would be a good start, and then perhaps after that we can
see what else makes sense to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#164Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#156)
Re: Parallel Seq Scan

On 2015-02-07 17:16:12 -0500, Robert Haas wrote:

On Sat, Feb 7, 2015 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:

[ criticicm of Amit's heapam integration ]

I'm not convinced that that reasoning is generally valid. While it may
work out nicely for seqscans - which might be useful enough on its own -
the more stuff we parallelize the *more* the executor will have to know
about it to make it sane. To actually scale nicely e.g. a parallel sort
will have to execute the nodes below it on each backend, instead of
doing that in one as a separate step, ferrying over all tuples to
indivdual backends through queues, and only then parallezing the
sort.

Now. None of that is likely to matter immediately, but I think starting
to build the infrastructure at the points where we'll later need it does
make some sense.

Well, I agree with you, but I'm not really sure what that has to do
with the issue at hand. I mean, if we were to apply Amit's patch,
we'd been in a situation where, for a non-parallel heap scan, heapam.c
decides the order in which blocks get scanned, but for a parallel heap
scan, nodeParallelSeqscan.c makes that decision. Maybe I'm an old
fuddy-duddy[1] but that seems like an abstraction violation to me. I
think the executor should see a parallel scan as a stream of tuples
that streams into a bunch of backends in parallel, without really
knowing how heapam.c is dividing up the work. That's how it's
modularized today, and I don't see a reason to change it. Do you?

I don't really agree. Normally heapam just sequentially scan the heap in
one go, not much logic to that. Ok, then there's also the synchronized
seqscan stuff - which just about every user of heapscans but the
executor promptly disables again. I don't think a heap_scan_page() or
similar API will consitute a relevant layering violation over what we
already have.

Note that I'm not saying that Amit's patch is right - I haven't read it
- but that I don't think a 'scan this range of pages' heapscan API would
not be a bad idea. Not even just for parallelism, but for a bunch of
usecases.

Regarding tuple flow between backends, I've thought about that before,
I agree that we need it, and I don't think I know how to do it. I can
see how to have a group of processes executing a single node in
parallel, or a single process executing a group of nodes we break off
from the query tree and push down to it, but what you're talking about
here is a group of processes executing a group of nodes jointly.

I don't think it really is that. I think you'd do it essentially by
introducing a couple more nodes. Something like

SomeUpperLayerNode
|
|
AggCombinerNode
/ \
/ \
/ \
PartialHashAggNode PartialHashAggNode .... .PartialHashAggNode ...
| |
| |
| |
| |
PartialSeqScan PartialSeqScan

The only thing that'd potentially might need to end up working jointly
jointly would be the block selection of the individual PartialSeqScans
to avoid having to wait for stragglers for too long. E.g. each might
just ask for a range of a 16 megabytes or so that it scans sequentially.

In such a plan - a pretty sensible and not that uncommon thing for
parallelized aggregates - you'd need to be able to tell the heap scans
which blocks to scan. Right?

That seems like an excellent idea, but I don't know how to design it.
Actually routing the tuples between whichever backends we want to
exchange them between is easy enough, but how do we decide whether to
generate such a plan? What does the actual plan tree look like?

I described above how I think it'd roughly look like. Whether to
generate it probably would be dependant on the cardinality (not much
point to do the above if all groups are distinct) and possibly the
aggregates in use (if we have a parallizable sum/count/avg etc).

Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
scan, mostly, I would think) and can-absorb-parallel-tuple-streams
(sort, hash, materialize), or something like that, but I'm really
fuzzy on the details.

I don't think we really should have individual nodes that produce
multiple streams - that seems like it'd end up being really
complicated. I'd more say that we have distinct nodes (like the
PartialSeqScan ones above) that do a teensy bit of coordination about
which work to perform.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#165Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#164)
Re: Parallel Seq Scan

On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Note that I'm not saying that Amit's patch is right - I haven't read it
- but that I don't think a 'scan this range of pages' heapscan API would
not be a bad idea. Not even just for parallelism, but for a bunch of
usecases.

We do have that, already. heap_setscanlimits(). I'm just not
convinced that that's the right way to split up a parallel scan.
There's too much risk of ending up with a very-uneven distribution of
work.

Regarding tuple flow between backends, I've thought about that before,
I agree that we need it, and I don't think I know how to do it. I can
see how to have a group of processes executing a single node in
parallel, or a single process executing a group of nodes we break off
from the query tree and push down to it, but what you're talking about
here is a group of processes executing a group of nodes jointly.

I don't think it really is that. I think you'd do it essentially by
introducing a couple more nodes. Something like

SomeUpperLayerNode
|
|
AggCombinerNode
/ \
/ \
/ \
PartialHashAggNode PartialHashAggNode .... .PartialHashAggNode ...
| |
| |
| |
| |
PartialSeqScan PartialSeqScan

The only thing that'd potentially might need to end up working jointly
jointly would be the block selection of the individual PartialSeqScans
to avoid having to wait for stragglers for too long. E.g. each might
just ask for a range of a 16 megabytes or so that it scans sequentially.

In such a plan - a pretty sensible and not that uncommon thing for
parallelized aggregates - you'd need to be able to tell the heap scans
which blocks to scan. Right?

For this case, what I would imagine is that there is one parallel heap
scan, and each PartialSeqScan attaches to it. The executor says "give
me a tuple" and heapam.c provides one. Details like the chunk size
are managed down inside heapam.c, and the executor does not know about
them. It just knows that it can establish a parallel scan and then
pull tuples from it.

Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
scan, mostly, I would think) and can-absorb-parallel-tuple-streams
(sort, hash, materialize), or something like that, but I'm really
fuzzy on the details.

I don't think we really should have individual nodes that produce
multiple streams - that seems like it'd end up being really
complicated. I'd more say that we have distinct nodes (like the
PartialSeqScan ones above) that do a teensy bit of coordination about
which work to perform.

I think we're in violent agreement here, except for some
terminological confusion. Are there N PartialSeqScan nodes, one
running in each node, or is there one ParallelSeqScan node, which is
copied and run jointly across N nodes? You can talk about either way
and have it make sense, but we haven't had enough conversations about
this on this list to have settled on a consistent set of vocabulary
yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#166Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#165)
Re: Parallel Seq Scan

On 2015-02-10 08:52:09 -0500, Robert Haas wrote:

On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Note that I'm not saying that Amit's patch is right - I haven't read it
- but that I don't think a 'scan this range of pages' heapscan API would
not be a bad idea. Not even just for parallelism, but for a bunch of
usecases.

We do have that, already. heap_setscanlimits(). I'm just not
convinced that that's the right way to split up a parallel scan.
There's too much risk of ending up with a very-uneven distribution of
work.

If you make the chunks small enough, and then coordate only the chunk
distribution, not really.

Regarding tuple flow between backends, I've thought about that before,
I agree that we need it, and I don't think I know how to do it. I can
see how to have a group of processes executing a single node in
parallel, or a single process executing a group of nodes we break off
from the query tree and push down to it, but what you're talking about
here is a group of processes executing a group of nodes jointly.

I don't think it really is that. I think you'd do it essentially by
introducing a couple more nodes. Something like

SomeUpperLayerNode
|
|
AggCombinerNode
/ \
/ \
/ \
PartialHashAggNode PartialHashAggNode .... .PartialHashAggNode ...
| |
| |
| |
| |
PartialSeqScan PartialSeqScan

The only thing that'd potentially might need to end up working jointly
jointly would be the block selection of the individual PartialSeqScans
to avoid having to wait for stragglers for too long. E.g. each might
just ask for a range of a 16 megabytes or so that it scans sequentially.

In such a plan - a pretty sensible and not that uncommon thing for
parallelized aggregates - you'd need to be able to tell the heap scans
which blocks to scan. Right?

For this case, what I would imagine is that there is one parallel heap
scan, and each PartialSeqScan attaches to it. The executor says "give
me a tuple" and heapam.c provides one. Details like the chunk size
are managed down inside heapam.c, and the executor does not know about
them. It just knows that it can establish a parallel scan and then
pull tuples from it.

I think that's a horrible approach that'll end up with far more
entangled pieces than what you're trying to avoid. Unless the tuple flow
is organized to only happen in the necessary cases the performance will
be horrible. And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

A individual heap scan's state lives in process private memory. And if
the results inside the separate workers should directly be used in the
these workers without shipping over the network it'd be horrible to have
the logic in the heapscan. How would you otherwise model an executor
tree that does the seqscan and aggregation combined in multiple
processes at the same time?

Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
scan, mostly, I would think) and can-absorb-parallel-tuple-streams
(sort, hash, materialize), or something like that, but I'm really
fuzzy on the details.

I don't think we really should have individual nodes that produce
multiple streams - that seems like it'd end up being really
complicated. I'd more say that we have distinct nodes (like the
PartialSeqScan ones above) that do a teensy bit of coordination about
which work to perform.

I think we're in violent agreement here, except for some
terminological confusion. Are there N PartialSeqScan nodes, one
running in each node, or is there one ParallelSeqScan node, which is
copied and run jointly across N nodes? You can talk about either way
and have it make sense, but we haven't had enough conversations about
this on this list to have settled on a consistent set of vocabulary
yet.

I pretty strongly believe that it has to be independent scan nodes. Both
from a implementation and a conversational POV. They might have some
very light cooperation between them (e.g. coordinating block ranges or
such), but everything else should be separate. From an implementation
POV it seems pretty awful to have executor node that's accessed by
multiple separate backends - that'd mean it have to be concurrency safe,
have state in shared memory and everything.

Now, there'll be a node that needs to do some parallel magic - but in
the above example that should be the AggCombinerNode, which would not
only ask for tuples from one of the children at a time, but ask multiple
ones in parallel. But even then it doesn't have to deal with concurrency
around it's own state.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#167Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#166)
Re: Parallel Seq Scan

On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:

If you make the chunks small enough, and then coordate only the chunk
distribution, not really.

True, but why do you want to do that in the executor instead of in the heapam?

For this case, what I would imagine is that there is one parallel heap
scan, and each PartialSeqScan attaches to it. The executor says "give
me a tuple" and heapam.c provides one. Details like the chunk size
are managed down inside heapam.c, and the executor does not know about
them. It just knows that it can establish a parallel scan and then
pull tuples from it.

I think that's a horrible approach that'll end up with far more
entangled pieces than what you're trying to avoid. Unless the tuple flow
is organized to only happen in the necessary cases the performance will
be horrible.

I can't understand this at all. A parallel heap scan, as I've coded
it up, involves no tuple flow at all. All that's happening at the
heapam.c layer is that we're coordinating which blocks to scan. Not
to be disrespectful, but have you actually looked at the patch?

And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

If it's true that a good chunk size depends on the higher layers, then
that would be a good argument for doing this differently, or at least
exposing an API for the higher layers to tell heapam.c what chunk size
they want. I hadn't considered that possibility - can you elaborate
on why you think we might want to vary the chunk size?

A individual heap scan's state lives in process private memory. And if
the results inside the separate workers should directly be used in the
these workers without shipping over the network it'd be horrible to have
the logic in the heapscan. How would you otherwise model an executor
tree that does the seqscan and aggregation combined in multiple
processes at the same time?

Again, the heap scan is not shipping anything anywhere ever in any
design of any patch proposed or written. The results *are* directly
used inside each individual worker.

I think we're in violent agreement here, except for some
terminological confusion. Are there N PartialSeqScan nodes, one
running in each node, or is there one ParallelSeqScan node, which is
copied and run jointly across N nodes? You can talk about either way
and have it make sense, but we haven't had enough conversations about
this on this list to have settled on a consistent set of vocabulary
yet.

I pretty strongly believe that it has to be independent scan nodes. Both
from a implementation and a conversational POV. They might have some
very light cooperation between them (e.g. coordinating block ranges or
such), but everything else should be separate. From an implementation
POV it seems pretty awful to have executor node that's accessed by
multiple separate backends - that'd mean it have to be concurrency safe,
have state in shared memory and everything.

I don't agree with that, but again I think it's a terminological
dispute. I think what will happen is that you will have a single node
that gets copied into multiple backends, and in some cases a small
portion of its state will live in shared memory. That's more or less
what you're thinking of too, I think.

But what I don't want is - if we've got a parallel scan-and-aggregate
happening in N nodes, EXPLAIN shows N copies of all of that - not only
because it's display clutter, but also because a plan to do that thing
with 3 workers is fundamentally the same as a plan to do it with 30
workers. Those plans shouldn't look different, except perhaps for a
line some place that says "Number of Workers: N".

Now, there'll be a node that needs to do some parallel magic - but in
the above example that should be the AggCombinerNode, which would not
only ask for tuples from one of the children at a time, but ask multiple
ones in parallel. But even then it doesn't have to deal with concurrency
around it's own state.

Sure, we clearly want to minimize the amount of coordination between nodes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#168Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#167)
Re: Parallel Seq Scan

On 2015-02-10 09:23:02 -0500, Robert Haas wrote:

On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:

And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

If it's true that a good chunk size depends on the higher layers, then
that would be a good argument for doing this differently, or at least
exposing an API for the higher layers to tell heapam.c what chunk size
they want. I hadn't considered that possibility - can you elaborate
on why you think we might want to vary the chunk size?

Because things like chunk size depend on the shape of the entire
plan. If you have a 1TB table and want to sequentially scan it in
parallel with 10 workers you better use some rather large chunks. That
way readahead will be efficient in a cpu/socket local manner,
i.e. directly reading in the pages into the directly connected memory of
that cpu. Important for performance on a NUMA system, otherwise you'll
constantly have everything go over the shared bus. But if you instead
have a plan where the sequential scan goes over a 1GB table, perhaps
with some relatively expensive filters, you'll really want a small
chunks size to avoid waiting. The chunk size will also really depend on
what other nodes are doing, at least if they can run in the same worker.

Even without things like NUMA and readahead I'm pretty sure that you'll
want a chunk size a good bit above one page. The locks we acquire for
the buffercache lookup and for reading the page are already quite bad
for performance/scalability; even if we don't always/often hit the same
lock. Making 20 processes that scan pages in parallel acquire yet a
another lock (that's shared between all of them!) for every single page
won't be fun, especially without or fast filters.

For this case, what I would imagine is that there is one parallel heap
scan, and each PartialSeqScan attaches to it. The executor says "give
me a tuple" and heapam.c provides one. Details like the chunk size
are managed down inside heapam.c, and the executor does not know about
them. It just knows that it can establish a parallel scan and then
pull tuples from it.

I think that's a horrible approach that'll end up with far more
entangled pieces than what you're trying to avoid. Unless the tuple flow
is organized to only happen in the necessary cases the performance will
be horrible.

I can't understand this at all. A parallel heap scan, as I've coded
it up, involves no tuple flow at all. All that's happening at the
heapam.c layer is that we're coordinating which blocks to scan. Not
to be disrespectful, but have you actually looked at the patch?

No, and I said so upthread. I started commenting because you argued that
architecturally parallelism belongs in heapam.c instead of upper layers,
and I can't agree with that. I now have, and it looks less bad than I
had assumed, sorry.

Unfortunately I still think it's wrong approach, also sorry.

As pointed out above (moved there after reading the patch...) I don't
think a chunk size of 1 or any other constant size can make sense. I
don't even believe it'll necessarily be constant across an entire query
execution (big initially, small at the end). Now, we could move
determining that before the query execution into executor
initialization, but then we won't yet know how many workers we're going
to get. We could add a function setting that at runtime, but that'd mix
up responsibilities quite a bit.

I also can't agree with having a static snapshot in shared memory put
there by the initialization function. For one it's quite awkward to end
up with several equivalent snapshots at various places in shared
memory. Right now the entire query execution can share one snapshot,
this way we'd end up with several of them. Imo for actual parallel
query execution the plan should be shared once and then be reused for
everything done in the name of the query.

Without the need to do that you end up pretty much with only with setup
for infrastructure so heap_parallelscan_nextpage is called. How about
instead renaming heap_beginscan_internal() to _extended and offering an
option to provide a callback + state that determines the next page?
Additionally provide some separate functions managing a simple
implementation of such a callback + state?

Btw, using a atomic uint32 you'd end up without the spinlock and just
about the same amount of code... Just do a atomic_fetch_add_until32(var,
1, InvalidBlockNumber)... ;)

I think we're in violent agreement here, except for some
terminological confusion. Are there N PartialSeqScan nodes, one
running in each node, or is there one ParallelSeqScan node, which is
copied and run jointly across N nodes? You can talk about either way
and have it make sense, but we haven't had enough conversations about
this on this list to have settled on a consistent set of vocabulary
yet.

I pretty strongly believe that it has to be independent scan nodes. Both
from a implementation and a conversational POV. They might have some
very light cooperation between them (e.g. coordinating block ranges or
such), but everything else should be separate. From an implementation
POV it seems pretty awful to have executor node that's accessed by
multiple separate backends - that'd mean it have to be concurrency safe,
have state in shared memory and everything.

I don't agree with that, but again I think it's a terminological
dispute. I think what will happen is that you will have a single node
that gets copied into multiple backends, and in some cases a small
portion of its state will live in shared memory. That's more or less
what you're thinking of too, I think.

Well, let me put it that way, I think that the tuple flow has to be
pretty much like I'd ascii-art'ed earlier. And that only very few nodes
will need to coordinate between query execution happening in different
workers. With that I mean it has to be possible to have queries like:

ParallelismDrivingNode
|
---------------- Parallelism boundary
|
NestLoop
/ \
CSeqScan IndexScan

Where the 'coordinated seqscan' scans a relation so that each tuple
eventually gets returned once across all nodes, but the nested loop (and
through it the index scan) will just run normally, without any
coordination and parallelism. But everything below --- would happen
multiple nodes. If you agree, yes, then we're in violent agreement
;). The "single node that gets copied" bit above makes me a bit unsure
whether we are though.

To me, given the existing executor code, it seems easiest to achieve
that by having the ParallelismDrivingNode above having a dynamic number
of nestloop children in different backends and point the coordinated
seqscan to some shared state. As you point out, the number of these
children cannot be certainly known (just targeted for) at plan time;
that puts a certain limit on how independent they are. But since a
large number of them can be independent between workers it seems awkward
to generally treat them as being the same node across workers. But maybe
that's just an issue with my mental model.

But what I don't want is - if we've got a parallel scan-and-aggregate
happening in N nodes, EXPLAIN shows N copies of all of that - not only
because it's display clutter, but also because a plan to do that thing
with 3 workers is fundamentally the same as a plan to do it with 30
workers. Those plans shouldn't look different, except perhaps for a
line some place that says "Number of Workers: N".

I'm really not concerned with what explain is going to show. We can do
quite some fudging there - it's not like it's a 1:1 representation of
the query plan.

I think we're getting to the point where having a unique mapping from
the plan to the execution tree is proving to be rather limiting
anyway. Check for example discussion about join removal. But even for
current code, showing only the custom plans for the first five EXPLAIN
EXECUTEs is pretty nasty (Try explain that to somebody that doesn't know
pg internals. Their looks are worth gold and can kill you at the same
time) and should be done differently.

And I actually can very well imagine that you'd want a option to show
the different execution statistics for every worker in the ANALYZE case.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#169Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#168)
Re: Parallel Seq Scan

On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2015-02-10 09:23:02 -0500, Robert Haas wrote:

On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:

And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

If it's true that a good chunk size depends on the higher layers, then
that would be a good argument for doing this differently, or at least
exposing an API for the higher layers to tell heapam.c what chunk size
they want. I hadn't considered that possibility - can you elaborate
on why you think we might want to vary the chunk size?

Because things like chunk size depend on the shape of the entire
plan. If you have a 1TB table and want to sequentially scan it in
parallel with 10 workers you better use some rather large chunks. That
way readahead will be efficient in a cpu/socket local manner,
i.e. directly reading in the pages into the directly connected memory of
that cpu. Important for performance on a NUMA system, otherwise you'll
constantly have everything go over the shared bus. But if you instead
have a plan where the sequential scan goes over a 1GB table, perhaps
with some relatively expensive filters, you'll really want a small
chunks size to avoid waiting.

I see. That makes sense.

The chunk size will also really depend on
what other nodes are doing, at least if they can run in the same worker.

Example?

Even without things like NUMA and readahead I'm pretty sure that you'll
want a chunk size a good bit above one page. The locks we acquire for
the buffercache lookup and for reading the page are already quite bad
for performance/scalability; even if we don't always/often hit the same
lock. Making 20 processes that scan pages in parallel acquire yet a
another lock (that's shared between all of them!) for every single page
won't be fun, especially without or fast filters.

This is possible, but I'm skeptical. If the amount of other work we
have to do that page is so little that the additional spinlock cycle
per page causes meaningful contention, I doubt we should be
parallelizing in the first place.

No, and I said so upthread. I started commenting because you argued that
architecturally parallelism belongs in heapam.c instead of upper layers,
and I can't agree with that. I now have, and it looks less bad than I
had assumed, sorry.

OK, that's something.

Unfortunately I still think it's wrong approach, also sorry.

As pointed out above (moved there after reading the patch...) I don't
think a chunk size of 1 or any other constant size can make sense. I
don't even believe it'll necessarily be constant across an entire query
execution (big initially, small at the end). Now, we could move
determining that before the query execution into executor
initialization, but then we won't yet know how many workers we're going
to get. We could add a function setting that at runtime, but that'd mix
up responsibilities quite a bit.

I still think this belongs in heapam.c somehow or other. If the logic
is all in the executor, then it becomes impossible for any code that
doensn't use the executor to do a parallel heap scan, and that's
probably bad. It's not hard to imagine something like CLUSTER wanting
to reuse that code, and that won't be possible if the logic is up in
some higher layer. If the logic we want is to start with a large
chunk size and then switch to a small chunk size when there's not much
of the relation left to scan, there's still no reason that can't be
encapsulated in heapam.c.

Btw, using a atomic uint32 you'd end up without the spinlock and just
about the same amount of code... Just do a atomic_fetch_add_until32(var,
1, InvalidBlockNumber)... ;)

I thought of that, but I think there's an overflow hazard.

Where the 'coordinated seqscan' scans a relation so that each tuple
eventually gets returned once across all nodes, but the nested loop (and
through it the index scan) will just run normally, without any
coordination and parallelism. But everything below --- would happen
multiple nodes. If you agree, yes, then we're in violent agreement
;). The "single node that gets copied" bit above makes me a bit unsure
whether we are though.

Yeah, I think we're talking about the same thing.

To me, given the existing executor code, it seems easiest to achieve
that by having the ParallelismDrivingNode above having a dynamic number
of nestloop children in different backends and point the coordinated
seqscan to some shared state. As you point out, the number of these
children cannot be certainly known (just targeted for) at plan time;
that puts a certain limit on how independent they are. But since a
large number of them can be independent between workers it seems awkward
to generally treat them as being the same node across workers. But maybe
that's just an issue with my mental model.

I think it makes sense to think of a set of tasks in which workers can
assist. So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node. To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#170Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#169)
Re: Parallel Seq Scan

On Thu, Feb 12, 2015 at 2:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On 2015-02-10 09:23:02 -0500, Robert Haas wrote:

On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

As pointed out above (moved there after reading the patch...) I don't
think a chunk size of 1 or any other constant size can make sense. I
don't even believe it'll necessarily be constant across an entire query
execution (big initially, small at the end). Now, we could move
determining that before the query execution into executor
initialization, but then we won't yet know how many workers we're going
to get. We could add a function setting that at runtime, but that'd mix
up responsibilities quite a bit.

I still think this belongs in heapam.c somehow or other. If the logic
is all in the executor, then it becomes impossible for any code that
doensn't use the executor to do a parallel heap scan, and that's
probably bad. It's not hard to imagine something like CLUSTER wanting
to reuse that code, and that won't be possible if the logic is up in
some higher layer. If the logic we want is to start with a large
chunk size and then switch to a small chunk size when there's not much
of the relation left to scan, there's still no reason that can't be
encapsulated in heapam.c.

It seems to me that we need to use both ways (make heap or other lower
layers aware of parallelism and another one is handle at executor level and
use callback_function and callback_state to make it work) for doing
parallelism. TBH, I think for the matter of this patch we can go either way
and then think more on it as we move ahead to parallelize other operations.
So what I can do is to try using Robert's patch to make heap aware of
parallelism and then see how it comes up?

Btw, using a atomic uint32 you'd end up without the spinlock and just
about the same amount of code... Just do a atomic_fetch_add_until32(var,
1, InvalidBlockNumber)... ;)

I thought of that, but I think there's an overflow hazard.

Where the 'coordinated seqscan' scans a relation so that each tuple
eventually gets returned once across all nodes, but the nested loop (and
through it the index scan) will just run normally, without any
coordination and parallelism. But everything below --- would happen
multiple nodes. If you agree, yes, then we're in violent agreement
;). The "single node that gets copied" bit above makes me a bit unsure
whether we are though.

Yeah, I think we're talking about the same thing.

To me, given the existing executor code, it seems easiest to achieve
that by having the ParallelismDrivingNode above having a dynamic number
of nestloop children in different backends and point the coordinated
seqscan to some shared state. As you point out, the number of these
children cannot be certainly known (just targeted for) at plan time;
that puts a certain limit on how independent they are. But since a
large number of them can be independent between workers it seems awkward
to generally treat them as being the same node across workers. But maybe
that's just an issue with my mental model.

I think it makes sense to think of a set of tasks in which workers can
assist. So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node. To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

As per my understanding of the discussion related to this point, I think
there are 3 somewhat related ways to achieve this.

1. Both master and worker runs the same node (ParallelSeqScan) where
the work done by worker (scan chunks of the heap) for this node is
subset of what is done by master (coordinate the data returned by workers +
scan chunks of heap). It seems to me Robert is advocating this approach.
2. Master and worker uses different nodes to operate. Master runs
parallelism
drivingnode (ParallelSeqscan - coordinate the data returned by workers +
scan chunks of heap ) and worker runs some form of Parallelismdriver
node (PartialSeqScan - scan chunks of the heap). It seems to me
Andres is proposing this approach.
3. Same as 2, but modify existing SeqScan node to behave as
PartialSeqScan. This is what I have done in patch.

Correct me or add here if I have misunderstood any thing.

I think going forward (for cases like aggregation) the work done in
Master and Worker node will have substantial differences that it
is better to do the work as part of different nodes in master and
worker.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#171Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#163)
1 attachment(s)
Re: Parallel Seq Scan

On Mon, Feb 9, 2015 at 7:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 9, 2015 at 2:31 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Another idea is to use Executor level interfaces (like ExecutorStart(),
ExecutorRun(), ExecutorEnd()) for execution rather than using Portal
level interfaces. I have used Portal level interfaces with the
thought that we can reuse the existing infrastructure of Portal to
make parallel execution of scrollable cursors, but as per my analysis
it is not so easy to support them especially backward scan, absolute/
relative fetch, etc, so Executor level interfaces seems more appealing
to me (something like how Explain statement works (ExplainOnePlan)).
Using Executor level interfaces will have advantage that we can reuse

them

for other parallel functionalaties. In this approach, we need to take
care of constructing relavant structures (with the information passed by
master backend) required for Executor interfaces, but I think these

should

be lesser than what we need in previous approach (extract seqscan

specific

stuff from executor).

I think using the executor-level interfaces instead of the
portal-level interfaces is a good idea. That would possibly let us
altogether prohibit access to the portal layer from within a parallel
worker, which seems like it might be a good sanity check to add. But
that seems to still require us to have a PlannedStmt and a QueryDesc,
and I'm not sure whether that's going to be too much of a pain. We
might need to think about an alternative API for starting the Executor
like ExecutorStartParallel() or ExecutorStartExtended(). But I'm not
sure. If you can revise things to go through the executor interfaces
I think that would be a good start, and then perhaps after that we can
see what else makes sense to do.

Okay, I have modified the patch to use Executor level interfaces
rather than Portal-level interfaces. To achieve that I need to add
a new Dest (DestRemoteBackend). For now, I have modified
printtup.c to handle this new destination type similar to what
it does for DestRemote and DestRemoteExecute.

Apart from above, the other major changes to address your concerns
and review comments are:
a. Made InitiateWorkers() and ParallelQueryMain(an entry function for
parallel query execution) modular
b. Adapted the parallel-heap-scan patch posted by Robert upthread
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
c. Now master and worker backend, both run as part of same node
ParallelSeqScan (I have yet to update copy and out funcs for new
parameters), check if you think that is the right way to go. I still
feel it would have been better if master and backend worker runs
as part of different nodes, however this also looks okay for the
purpose of parallel sequential scan.

I have yet to modify the code to allow expressions in projection
and allowing joins, I think these are related to allow-parallel-safety
patch, I will once take a look at that patch and then modify
accordingly.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v7.patchapplication/octet-stream; name=parallel_seqscan_v7.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..a032afb 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -82,7 +82,7 @@ printtup_create_DR(CommandDest dest)
 
 	/*
 	 * Send T message automatically if DestRemote, but not if
-	 * DestRemoteExecute
+	 * DestRemoteExecute or DestRemoteBackend
 	 */
 	self->sendDescrip = (dest == DestRemote);
 
@@ -95,7 +95,8 @@ printtup_create_DR(CommandDest dest)
 }
 
 /*
- * Set parameters for a DestRemote (or DestRemoteExecute) receiver
+ * Set parameters for a DestRemote (or DestRemoteExecute or DestRemoteBackend)
+ * receiver
  */
 void
 SetRemoteDestReceiverParams(DestReceiver *self, Portal portal)
@@ -103,7 +104,8 @@ SetRemoteDestReceiverParams(DestReceiver *self, Portal portal)
 	DR_printtup *myState = (DR_printtup *) self;
 
 	Assert(myState->pub.mydest == DestRemote ||
-		   myState->pub.mydest == DestRemoteExecute);
+		   myState->pub.mydest == DestRemoteExecute ||
+		   myState->pub.mydest == DestRemoteBackend);
 
 	myState->portal = portal;
 
@@ -243,7 +245,19 @@ SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist, int16 *formats)
 				pq_sendint(&buf, 0, 2);
 		}
 	}
-	pq_endmessage(&buf);
+
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 }
 
 /*
@@ -252,9 +266,15 @@ SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist, int16 *formats)
 static void
 printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
 {
-	int16	   *formats = myState->portal->formats;
+	int16	   *formats;
 	int			i;
 
+	/* Remote backend always uses binary format to communicate. */
+	if (myState->pub.mydest == DestRemoteBackend)
+		formats = NULL;
+	else
+		formats = myState->portal->formats;
+
 	/* get rid of any old data */
 	if (myState->myinfo)
 		pfree(myState->myinfo);
@@ -271,7 +291,12 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
 	for (i = 0; i < numAttrs; i++)
 	{
 		PrinttupAttrInfo *thisState = myState->myinfo + i;
-		int16		format = (formats ? formats[i] : 0);
+		int16		format;
+
+		if (myState->pub.mydest == DestRemoteBackend)
+			format = (formats ? formats[i] : 1);
+		else
+			format = (formats ? formats[i] : 0);
 
 		thisState->format = format;
 		if (format == 0)
@@ -371,7 +396,18 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 		}
 	}
 
-	pq_endmessage(&buf);
+	/*
+	 * Send the message via shared-memory tuple queue, if the same
+	 * is enabled.
+	 */
+	if (is_tuple_shm_mq_enabled())
+	{
+		mq_putmessage_direct(buf.cursor, buf.data, buf.len);
+		pfree(buf.data);
+		buf.data = NULL;
+	}
+	else
+		pq_endmessage(&buf);
 
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..116a717
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,339 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId);
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId);
+
+/*
+ * shm_beginscan
+ *
+ * Initializes the shared memory scan descriptor to retrieve tuples
+ * from worker backends. 
+ */
+ShmScanDesc
+shm_beginscan(int num_queues)
+{
+	ShmScanDesc		shmscan;
+
+	shmscan = palloc(sizeof(ShmScanDescData));
+
+	shmscan->num_shm_queues = num_queues;
+	shmscan->ss_cqueue = -1;
+	shmscan->shmscan_inited	= false;
+
+	return shmscan;
+}
+
+/*
+ * ExecInitWorkerResult
+ *
+ * Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers)
+{
+	worker_result	workerResult;
+	int				i;
+	int	natts = tupdesc->natts;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+	workerResult->receive_functions = palloc(sizeof(FmgrInfo) * natts);
+	workerResult->typioparams = palloc(sizeof(Oid) * natts);
+	workerResult->num_shm_queues = nWorkers;
+	workerResult->queue_detached = palloc0(sizeof(bool) * nWorkers);
+
+	for (i = 0;	i < natts; ++i)
+	{
+		Oid	receive_function_id;
+
+		getTypeBinaryInputInfo(tupdesc->attrs[i]->atttypid,
+							   &receive_function_id,
+							   &workerResult->typioparams[i]);
+		fmgr_info(receive_function_id, &workerResult->receive_functions[i]);
+	}
+
+	return workerResult;
+}
+
+
+/*
+ * shm_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+			worker_result resultState, shm_mq_handle **responseq,
+			TupleDesc tupdesc, ScanDirection direction, bool *fromheap)
+{
+	shm_mq_result	res;
+	Size			nbytes;
+	void			*data;
+	StringInfoData	msg;
+	int				queueId = 0;
+
+	/*
+	 * calculate next starting queue used for fetching tuples
+	 */
+	if(!shmScan->shmscan_inited)
+	{
+		shmScan->shmscan_inited = true;
+		Assert(shmScan->num_shm_queues > 0);
+		queueId = 0;
+	}
+	else
+		queueId = shmScan->ss_cqueue;
+
+	/* Read and processes messages from the shared memory queues. */
+	for(;;)
+	{
+		if (!resultState->all_queues_detached)
+		{
+			if (queueId == shmScan->num_shm_queues)
+				queueId = 0;
+
+			/*
+			 * Don't fetch from detached queue.  This loop could continue
+			 * forever, if we reach a situation such that all queue's are
+			 * detached, however we won't reach here if that is the case.
+			 */
+			while (resultState->queue_detached[queueId])
+			{
+				++queueId;
+				if (queueId == shmScan->num_shm_queues)
+					queueId = 0;
+			}
+
+			for (;;)
+			{
+				/*
+				 * mark current queue used for fetching tuples, this is used
+				 * to fetch consecutive tuples from queue used in previous
+				 * fetch.
+				 */
+				shmScan->ss_cqueue = queueId;
+
+				/* Get next message. */
+				res = shm_mq_receive(responseq[queueId], &nbytes, &data, true);
+				if (res == SHM_MQ_DETACHED)
+				{
+					/*
+					 * mark the queue that got detached, so that we don't
+					 * try to fetch from it again.
+					 */
+					resultState->queue_detached[queueId] = true;
+					--resultState->num_shm_queues;
+					/*
+					 * if we have exhausted data from all worker queues, then don't
+					 * process data from queues.
+					 */
+					if (resultState->num_shm_queues <= 0)
+						resultState->all_queues_detached = true;
+					break;
+				}
+				else if (res == SHM_MQ_WOULD_BLOCK)
+					break;
+				else if (res == SHM_MQ_SUCCESS)
+				{
+					bool rettuple;
+					initStringInfo(&msg);
+					appendBinaryStringInfo(&msg, data, nbytes);
+					rettuple = HandleParallelTupleMessage(resultState, tupdesc, &msg, queueId);
+					pfree(msg.data);
+					if (rettuple)
+					{
+						*fromheap = false;
+						return resultState->tuple;
+					}
+				}
+			}
+		}
+
+		/*
+		 * if we have checked all the message queue's and didn't find
+		 * any message or we have already fetched all the data from queue's,
+		 * then it's time to fetch directly from heap.  Reset the current
+		 * queue as the first queue from which we need to receive tuples.
+		 */
+		if ((queueId == shmScan->num_shm_queues - 1 ||
+			 resultState->all_queues_detached) &&
+			 !resultState->all_heap_fetched)
+		{
+			HeapTuple	tuple;
+			shmScan->ss_cqueue = 0;
+			tuple = heap_getnext(scanDesc, direction);
+			if (tuple)
+			{
+				*fromheap = true;
+				return tuple;
+			}
+			else if (tuple == NULL && resultState->all_queues_detached)
+				break;
+			else
+				resultState->all_heap_fetched = true;
+		}
+		else if (resultState->all_queues_detached &&
+				 resultState->all_heap_fetched)
+			break;
+
+		/* check the data in next queue. */
+		++queueId;
+	}
+
+	return NULL;
+}
+
+/*
+ * HandleParallelTupleMessage
+ *
+ * Handle a single tuple related protocol message received from
+ * a single parallel worker.
+ */
+static bool
+HandleParallelTupleMessage(worker_result resultState, TupleDesc tupdesc,
+						   StringInfo msg, int queueId)
+{
+	char	msgtype;
+	bool	rettuple = false;
+
+	msgtype = pq_getmsgbyte(msg);
+
+	/* Dispatch on message type. */
+	switch (msgtype)
+	{
+		case 'D':
+			{
+				/* Handle DataRow message. */
+				resultState->tuple = form_result_tuple(resultState, tupdesc, msg, queueId);
+				rettuple = true;
+				break;
+			}
+		case 'C':
+			{
+				/*
+				 * Handle CommandComplete message. Ignore tags sent by
+				 * worker backend as we are anyway going to use tag of
+				 * master backend for sending the same to client.
+				 */
+				(void) pq_getmsgstring(msg);
+				break;
+			}
+		case 'G':
+		case 'H':
+		case 'W':
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("COPY protocol not allowed in worker")));
+			}
+		default:
+			elog(WARNING, "unknown message type: %c", msg->data[0]);
+			break;
+	}
+
+	return rettuple;
+}
+
+/*
+ * form_result_tuple
+ *
+ * Parse a DataRow message and form a result tuple.
+ */
+static HeapTuple
+form_result_tuple(worker_result resultState, TupleDesc tupdesc,
+				  StringInfo msg, int queueId)
+{
+	/* Handle DataRow message. */
+	int16	natts = pq_getmsgint(msg, 2);
+	int16	i;
+	Datum  *values = NULL;
+	bool   *isnull = NULL;
+	HeapTuple	tuple;
+	StringInfoData	buf;
+
+	if (natts != tupdesc->natts)
+		elog(ERROR, "malformed DataRow");
+	if (natts > 0)
+	{
+		values = palloc(natts * sizeof(Datum));
+		isnull = palloc(natts * sizeof(bool));
+	}
+	initStringInfo(&buf);
+
+	for (i = 0; i < natts; ++i)
+	{
+		int32	bytes = pq_getmsgint(msg, 4);
+
+		if (bytes < 0)
+		{
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											NULL,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = true;
+		}
+		else
+		{
+			resetStringInfo(&buf);
+			appendBinaryStringInfo(&buf, pq_getmsgbytes(msg, bytes), bytes);
+			values[i] = ReceiveFunctionCall(&resultState->receive_functions[i],
+											&buf,
+											resultState->typioparams[i],
+											tupdesc->attrs[i]->atttypmod);
+			isnull[i] = false;
+		}
+	}
+
+	pq_getmsgend(msg);
+
+	tuple = heap_form_tuple(tupdesc, values, isnull);
+
+	/*
+	 * Release locally palloc'd space.  XXX would probably be good to pfree
+	 * values of pass-by-reference datums, as well.
+	 */
+	pfree(values);
+	pfree(isnull);
+
+	pfree(buf.data);
+
+	return tuple;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7cfc9bb..8b85e97 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -917,6 +918,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_ParallelSeqScan:
+			pname = sname = "Parallel Seq Scan";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1066,6 +1070,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1207,6 +1212,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_ParallelSeqScan)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((ParallelSeqScanState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((ParallelSeqScanState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1332,6 +1355,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_ParallelSeqScan:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((ParallelSeqScan *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2224,6 +2255,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..9a8ca75 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,7 +21,7 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeSeqscan.o nodeParallelSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..f77a77f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeParallelSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +191,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_ParallelSeqScan:
+			result = (PlanState *) ExecInitParallelSeqScan((ParallelSeqScan *) node,
+														   estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +412,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			result = ExecParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +654,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_ParallelSeqScanState:
+			ExecEndParallelSeqScan((ParallelSeqScanState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 3f0d809..229302d 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -191,8 +191,17 @@ ExecScan(ScanState *node,
 		 * check for non-nil qual here to avoid a function call to ExecQual()
 		 * when the qual is nil ... saves only a few cycles, but they add up
 		 * ...
+		 *
+		 * check for non-heap tuples (can get such tuples from shared memory
+		 * message queue's in case of parallel query), for such tuples no need
+		 * to perform qualification as for them the same is done by worker
+		 * backend.  This case will happen only for parallel query where we push
+		 * down the qualification.
+		 * XXX - We can do this optimization for projection as well, but for
+		 * now it is okay, as we don't allow parallel query if there are
+		 * expressions involved in target list.
 		 */
-		if (!qual || ExecQual(qual, econtext, false))
+		if (!slot->tts_fromheap || !qual || ExecQual(qual, econtext, false))
 		{
 			/*
 			 * Found a satisfactory scan tuple.
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 753754d..4c5bd88 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -123,6 +123,7 @@ MakeTupleTableSlot(void)
 	slot->tts_values = NULL;
 	slot->tts_isnull = NULL;
 	slot->tts_mintuple = NULL;
+	slot->tts_fromheap	= true;
 
 	return slot;
 }
@@ -473,6 +474,8 @@ ExecClearTuple(TupleTableSlot *slot)	/* slot in which to store tuple */
 	slot->tts_isempty = true;
 	slot->tts_nvalid = 0;
 
+	slot->tts_fromheap = true;
+
 	return slot;
 }
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..b7898a5 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,8 @@ BufferUsage pgBufferUsage;
 
 static void BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add, const BufferUsage *sub);
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 
 /* Allocate new instrumentation structure(s) */
@@ -127,6 +129,28 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
 static void
 BufferUsageAccumDiff(BufferUsage *dst,
@@ -148,3 +172,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
\ No newline at end of file
diff --git a/src/backend/executor/nodeParallelSeqscan.c b/src/backend/executor/nodeParallelSeqscan.c
new file mode 100644
index 0000000..397a47d
--- /dev/null
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -0,0 +1,364 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeParallelSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeParallelSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecParallelSeqScan				scans a relation.
+ *		ParallelSeqNext					retrieve next tuple from either heap or shared memory segment.
+ *		ExecInitParallelSeqScan			creates and initializes a parallel seqscan node.
+ *		ExecEndParallelSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ParallelSeqNext
+ *
+ *		This is a workhorse for ExecParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ParallelSeqNext(ParallelSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+	bool			fromheap = true;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	if(((ParallelSeqScan*)node->ss.ps.plan)->shm_toc_key)
+	{
+		/*
+		 * get the next tuple from the table
+		 */
+		tuple = heap_getnext(scandesc, direction);
+	}
+	else
+	{
+		/*
+		 * get the next tuple from the table based on result tuple descriptor.
+		 */
+		tuple = shm_getnext(scandesc, node->pss_currentShmScanDesc,
+							node->pss_workerResult,
+							node->responseq,
+							node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+							direction, &fromheap);
+	}
+
+	slot->tts_fromheap = fromheap;
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass '!fromheap'
+	 * because tuples returned by shm_getnext() are either pointers that are
+	 * created with palloc() or are pointers onto disk pages and so it should
+	 * be pfree()'d accordingly.  Note also that ExecStoreTuple will increment
+	 * the refcount of the buffer; the refcount will not be dropped until the
+	 * tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   fromheap ? scandesc->rs_cbuf : InvalidBuffer, /* buffer associated with this
+																	  * tuple */
+					   !fromheap);	/* pfree this pointer if not from heap */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * ParallelSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+ParallelSeqRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, ParallelSeqScan never use keys in
+	 * shm_beginscan/heap_beginscan (and this is very bad) - so, here
+	 * we do not check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitParallelScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitParallelScanRelation(ParallelSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+	ParallelHeapScanDesc pscan;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * For Explain statement, we don't want to initialize workers as
+	 * those are maily needed to execute the plan, however scan descriptor
+	 * still needs to be initialized for the purpose of InitNode functionality
+	 * (as EnNode functionality assumes that scan descriptor and scan relation
+	 * must be initialized, probably we can change that but that will make
+	 * the code EndParallelSeqScan look different than other node's end
+	 * functionality.
+	 *
+	 * XXX - If we want executorstart to initilize workers as well, then we
+	 * need to have a provision for waiting till all the workers get started
+	 * otherwise while doing endscan, it will try to wait for termination of
+	 * workers which are not even started (and will neither get started).
+	 */
+	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+	{
+		/* initialize a heapscan */
+		currentScanDesc = heap_beginscan(currentRelation,
+										 estate->es_snapshot,
+										 0,
+										 NULL);
+	}
+	else
+	{
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic shared
+		 * memory segment by master backend and parallel workers retrieve it
+		 * from shared memory.
+		 */
+		if (((ParallelSeqScan *) node->ss.ps.plan)->shm_toc_key != 0)
+		{
+			Assert(!pscan);
+
+			pscan = shm_toc_lookup(((ParallelSeqScan *) node->ss.ps.plan)->toc,
+								   ((ParallelSeqScan *) node->ss.ps.plan)->shm_toc_key);
+		}
+		else
+		{
+			/* Initialize the workers required to perform parallel scan. */
+			InitializeParallelWorkers(((SeqScan *) node->ss.ps.plan)->scanrelid,
+									  node->ss.ps.plan->targetlist,
+									  node->ss.ps.plan->qual,
+									  estate,
+									  currentRelation,
+									  &node->inst_options_space,
+									  &node->responseq,
+									  &node->pcxt,
+									  &pscan,
+									  ((ParallelSeqScan *)(node->ss.ps.plan))->num_workers);
+		}
+
+		currentScanDesc = heap_beginscan_parallel(currentRelation, pscan);
+	}
+
+	node->ss.ss_currentRelation = currentRelation;
+	node->ss.ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		InitShmScan
+ *
+ *		Set up to access the scan for shared memory segment.
+ * ----------------------------------------------------------------
+ */
+static void
+InitShmScan(ParallelSeqScanState *node)
+{
+	ShmScanDesc			 currentShmScanDesc;
+	worker_result		 workerResult;
+
+	/*
+	 * Shared memory scan needs to be initialized only for
+	 * master backend as worker backend scans only heap.
+	 */
+	if (((ParallelSeqScan *) node->ss.ps.plan)->shm_toc_key == 0)
+	{
+		/*
+		 * Use result tuple descriptor to fetch data from shared memory queues
+		 * as the worker backend's would have put the data after projection.
+		 * Number of queues must be equal to number of worker backend's.
+		 */
+		currentShmScanDesc = shm_beginscan(node->pcxt->nworkers);
+		workerResult = ExecInitWorkerResult(node->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor,
+											node->pcxt->nworkers);
+
+		node->pss_currentShmScanDesc = currentShmScanDesc;
+		node->pss_workerResult	= workerResult;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitParallelSeqScan
+ * ----------------------------------------------------------------
+ */
+ParallelSeqScanState *
+ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags)
+{
+	ParallelSeqScanState *parallelscanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	parallelscanstate = makeNode(ParallelSeqScanState);
+	parallelscanstate->ss.ps.plan = (Plan *) node;
+	parallelscanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &parallelscanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	parallelscanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) parallelscanstate);
+	parallelscanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) parallelscanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &parallelscanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &parallelscanstate->ss);
+
+	InitParallelScanRelation(parallelscanstate, estate, eflags);
+
+	parallelscanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&parallelscanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&parallelscanstate->ss);
+
+	/*
+	 * For Explain, we don't initialize the parallel workers, so
+	 * accordingly don't need to initialize the shared memory scan.
+	 */
+	if (!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
+		InitShmScan(parallelscanstate);
+
+	return parallelscanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecParallelSeqScan(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecParallelSeqScan(ParallelSeqScanState *node)
+{
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) ParallelSeqNext,
+					(ExecScanRecheckMtd) ParallelSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndParallelSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndParallelSeqScan(ParallelSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	if (node->pcxt)
+	{
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		ExitParallelMode();
+	}
+}
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index f12f2d5..cfab8b5 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -26,6 +26,8 @@ static bool pq_mq_busy = false;
 static pid_t pq_mq_parallel_master_pid = 0;
 static pid_t pq_mq_parallel_master_backend_id = InvalidBackendId;
 
+static shm_mq_handle *pq_mq_tuple_handle = NULL;
+
 static void mq_comm_reset(void);
 static int	mq_flush(void);
 static int	mq_flush_if_writable(void);
@@ -61,6 +63,26 @@ pq_redirect_to_shm_mq(shm_mq *mq, shm_mq_handle *mqh)
 }
 
 /*
+ * Arrange to send some frontend/backend protocol messages to a shared-memory
+ * tuple message queue.
+ */
+void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh)
+{
+	pq_mq_tuple_handle = mqh;
+}
+
+/*
+ * Check if tuples can be sent through tuple shared-memory
+ * message queue.
+ */
+bool
+is_tuple_shm_mq_enabled(void)
+{
+	return pq_mq_tuple_handle ? true : false;
+}
+
+/*
  * Arrange to SendProcSignal() to the parallel master each time we transmit
  * message data via the shm_mq.
  */
@@ -161,6 +183,42 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 	return 0;
 }
 
+/*
+ * Transmit a libpq protocol message to the shared memory message queue
+ * via pq_mq_tuple_handle.  We don't include a length word, because the
+ * receiver will know the length of the message from shm_mq_receive().
+ */
+int
+mq_putmessage_direct(char msgtype, const char *s, size_t len)
+{
+	shm_mq_iovec	iov[2];
+	shm_mq_result	result;
+
+	iov[0].data = &msgtype;
+	iov[0].len = 1;
+	iov[1].data = s;
+	iov[1].len = len;
+
+	Assert(pq_mq_tuple_handle != NULL);
+
+	for (;;)
+	{
+		result = shm_mq_sendv(pq_mq_tuple_handle, iov, 2, true);
+
+		if (result != SHM_MQ_WOULD_BLOCK)
+			break;
+
+		WaitLatch(&MyProc->procLatch, WL_LATCH_SET, 0);
+		CHECK_FOR_INTERRUPTS();
+		ResetLatch(&MyProc->procLatch);
+	}
+
+	Assert(result == SHM_MQ_SUCCESS || result == SHM_MQ_DETACHED);
+	if (result != SHM_MQ_SUCCESS)
+		return EOF;
+	return 0;
+}
+
 static void
 mq_putmessage_noblock(char msgtype, const char *s, size_t len)
 {
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f1a24f5..5846f22 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -352,6 +352,27 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyParallelSeqScan
+ */
+static ParallelSeqScan *
+_copyParallelSeqScan(const ParallelSeqScan *from)
+{
+	ParallelSeqScan    *newnode = makeNode(ParallelSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4039,6 +4060,9 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_ParallelSeqScan:
+			retval = _copyParallelSeqScan(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dd1278b..35c2e1e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -437,6 +437,16 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outParallelSeqScan(StringInfo str, const ParallelSeqScan *node)
+{
+	WRITE_NODE_TYPE("PARALLELSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2851,6 +2861,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_ParallelSeqScan:
+				_outParallelSeqScan(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index 2f2f5ed..7ecaa7f 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -74,3 +87,187 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	/* sizeof(ParamListInfoData) includes the first array element */
+	size = sizeof(ParamListInfoData) +
+		(num_params - 1) * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 020558b..4abfd25 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +227,73 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_parallelseqscan
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+	double		spc_seq_page_cost;
+	QualCost	qpqual_cost;
+	Cost		cpu_per_tuple;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	if (!enable_seqscan)
+		startup_cost += disable_cost;
+
+	/* fetch estimated page cost for tablespace containing table */
+	get_tablespace_page_costs(baserel->reltablespace,
+							  NULL,
+							  &spc_seq_page_cost);
+
+	/*
+	 * disk costs
+	 */
+	run_cost += spc_seq_page_cost * baserel->pages;
+
+	/* CPU costs */
+	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
+
+	startup_cost += qpqual_cost.startup;
+	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+	run_cost += cpu_per_tuple * baserel->tuples;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..fda6f40
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,148 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "nodes/relation.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ *	IsTargetListContainNonVars -
+ *		Check if target list contain non-var entries.
+ */
+static bool
+IsTargetListContainNonVars(List *targetlist)
+{
+	ListCell   *l;
+
+	foreach(l, targetlist)
+	{
+		TargetEntry *te = (TargetEntry *) lfirst(l);
+
+		if (!IsA(te, TargetEntry))
+			continue;			/* probably should never happen */
+		if (!IsA(te->expr, Var))
+			return true;
+	}
+	return false;
+}
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/*
+	 * parallel scan is not supported for joins.
+	 */
+	if (root->simple_rel_array_size > 2)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * parallel scan is not supported for non-var target list.
+	 *
+	 * XXX - This is to keep the implementation simple, we can do this
+	 * in future.  Here we are checking by passing root->parse->targetList
+	 * instead of rel->reltargetlist because rel->targetlist always contains
+	 * Vars (refer build_base_rel_tlists).
+	 */
+	if (IsTargetListContainNonVars(root->parse->targetList))
+	   return;
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	add_path(rel, (Path *) create_parallelseqscan_path(root, rel,
+													   num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 655be81..a8a626e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,9 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_parallelseqscan_plan(PlannerInfo *root,
+										 ParallelSeqPath *best_path,
+										 List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +103,9 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static ParallelSeqScan *make_parallelseqscan(List *qptlist, List *qpqual,
+											 Index scanrelid, int nworkers,
+											 shm_toc *toc, uint64 shm_toc_key);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +234,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +350,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_ParallelSeqScan:
+			plan = (Plan *) create_parallelseqscan_plan(root,
+														(ParallelSeqPath *) best_path,
+														tlist,
+														scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +560,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1148,70 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_worker_seqscan_plan
+ *
+ * Returns a parallel seqscan plan for the base relation scanned
+ * by worker with restriction clauses 'qual' and targetlist 'tlist'.
+ */
+Scan *
+create_worker_seqscan_plan(ParallelScanStmt *parallelscan)
+{
+	Scan	   *scan_plan;
+
+	scan_plan = (Scan*) make_parallelseqscan(parallelscan->targetList,
+											 parallelscan->qual,
+											 parallelscan->scanrelId,
+											 0,
+											 parallelscan->toc,
+											 parallelscan->shm_toc_scan_key);
+
+	return scan_plan;
+}
+
+/*
+ * create_parallelseqscan_plan
+ *
+ * Returns a parallel seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_parallelseqscan_plan(PlannerInfo *root, ParallelSeqPath *best_path,
+							List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->path.param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_parallelseqscan(tlist,
+											  scan_clauses,
+											  scan_relid,
+											  best_path->num_workers,
+											  NULL,
+											  0);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3318,6 +3397,30 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static ParallelSeqScan *
+make_parallelseqscan(List *qptlist,
+			   List *qpqual,
+			   Index scanrelid,
+			   int nworkers,
+			   shm_toc *toc,
+			   uint64 shm_toc_key)
+{
+	ParallelSeqScan *node = makeNode(ParallelSeqScan);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+	node->toc = toc;
+	node->shm_toc_key = shm_toc_key;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9cbbcfb..f9be27f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,74 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+/*
+ * create_worker_seqscan_plannedstmt
+ *
+ *	Returns a planned statement to be used by worker for execution.
+ *  Instead of master backend forming and passing the planned statement
+ *  to each worker, it just passes required information and PlannedStmt
+ *	is then constructed by worker.  The reason for doing so is that
+ *  master backend plan doesn't contain the subplans for each worker.
+ *  In future, if there is a need for doing so, we might want to
+ *  change the implementation master backend will pass the planned
+ *  statement directly.
+ */
+PlannedStmt	*
+create_worker_seqscan_plannedstmt(ParallelScanStmt *parallelscan)
+{
+	Plan    *scan_plan;
+	PlannedStmt	*result;
+	ListCell   *tlist;
+	Oid			reloid;
+
+	/* get the relid to save the same as part of planned statement. */
+	reloid = getrelid(parallelscan->scanrelId, parallelscan->rangetableList);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) parallelscan->qual);
+	fix_opfuncids((Node*) parallelscan->targetList);
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, parallelscan->targetList)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	scan_plan = (Plan*) create_worker_seqscan_plan(parallelscan);
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) scan_plan;
+	result->rtable = parallelscan->rangetableList;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->relationOids = lappend_oid(result->relationOids, reloid);
+	result->invalItems = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to get hasRowSecurity passed from master
+	 * backend as this is used only for invalidation and in
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 7703946..3a44aef 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -436,6 +436,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 78fb6b1..c35f934 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2163,6 +2163,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_ParallelSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..ea3b865 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,30 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_parallelseqscan_path
+ *
+ *	  Creates a path corresponding to a parallel sequential scan, returning the
+ *	  pathnode.
+ */
+ParallelSeqPath *
+create_parallelseqscan_path(PlannerInfo *root, RelOptInfo *rel, int nWorkers)
+{
+	ParallelSeqPath	   *pathnode = makeNode(ParallelSeqPath);
+
+	pathnode->path.pathtype = T_ParallelSeqScan;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->num_workers = nWorkers;
+
+	cost_parallelseqscan(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..890a0d5
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,562 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitializeParallelWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/parallel.h"
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeParallelSeqscan.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PARALLEL_KEY_SCANRELID		0
+#define PARALLEL_KEY_TARGETLIST		1
+#define PARALLEL_KEY_QUAL			2
+#define	PARALLEL_KEY_RANGETBL		3
+#define	PARALLEL_KEY_PARAMS			4
+#define PARALLEL_KEY_INST_OPTIONS	5
+#define PARALLEL_KEY_INST_INFO		6
+#define PARALLEL_KEY_TUPLE_QUEUE	7
+#define PARALLEL_KEY_SCAN			8
+#define PARALLEL_KEY_OPERATION		9
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void RestoreAndExecuteParallelScan(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelQueryElemsSpace(ParallelContext *pcxt,
+								char *targetlist_str, char *qual_str,
+								Size *targetlist_len, Size *qual_len);
+static void
+StoreParallelQueryElems(ParallelContext *pcxt,
+						char *targetlist_str, char *qual_str,
+						Size targetlist_len, Size qual_len);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_len);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_len,
+						 char **inst_options_space);
+static void
+EstimateParallelSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							 Index scanrelId, char *rangetbl_str,
+							 Size *rangetbl_len, Size *pscan_size);
+static void
+StoreParallelSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					 Index scanrelId, char *rangetbl_str,
+					 ParallelHeapScanDesc *pscan,
+					 Size rangetbl_len, Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueueAndStartWorkers(ParallelContext *pcxt,
+								  shm_mq_handle ***responseqp);
+static void
+GetParallelQueryElems(shm_toc *toc, List **targetList, List **qual);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument);
+static void
+GetParallelSeqScanInfo(shm_toc *toc, Index *scanrelId,
+					   List **rangeTableList);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq);
+
+/*
+ * EstimateParallelQueryElemsSpace
+ *
+ * Estimate the amount of space required to record information of
+ * query elements that need to be copied to parallel workers.
+ */
+void
+EstimateParallelQueryElemsSpace(ParallelContext *pcxt,
+								char *targetlist_str, char *qual_str,
+								Size *targetlist_len, Size *qual_len)
+{
+	*targetlist_len = strlen(targetlist_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *targetlist_len);
+
+	*qual_len = strlen(qual_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *qual_len);
+
+	/* keys for parallel query elements. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StoreParallelQueryElems
+ * 
+ * Sets up target list and qualification required for parallel
+ * execution.
+ */
+void
+StoreParallelQueryElems(ParallelContext *pcxt,
+						char *targetlist_str, char *qual_str,
+						Size targetlist_len, Size qual_len)
+{
+	char	   *targetlistdata;
+	char	   *qualdata;
+
+	/* Store target list in dynamic shared memory. */
+	targetlistdata = shm_toc_allocate(pcxt->toc, targetlist_len);
+	memcpy(targetlistdata, targetlist_str, targetlist_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TARGETLIST, targetlistdata);
+
+	/* Store qual list in dynamic shared memory. */
+	qualdata = shm_toc_allocate(pcxt->toc, qual_len);
+	memcpy(qualdata, qual_str, qual_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUAL, qualdata);
+}
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters and instrumentation information that need to be
+ * retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_len)
+{
+	*params_len = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_len);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 2);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters and instrumentation information
+ * required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_len,
+						 char **inst_options_space)
+{
+	char	*paramsdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_len);
+	SerializeBoundParams(params, params_len, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimateParallelSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * scanrelId, rangetable and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimateParallelSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							 Index scanrelId, char *rangetbl_str,
+							 Size *rangetbl_len, Size *pscan_size)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(NodeTag));
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(scanrelId));
+
+	*rangetbl_len = strlen(rangetbl_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *rangetbl_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 4);
+}
+
+/*
+ * StoreParallelSeqScan
+ * 
+ * Sets up the scanrelid, rangetable entries and block range
+ * for parallel sequence scan.
+ */
+void
+StoreParallelSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					 Index scanrelId, char *rangetbl_str,
+					 ParallelHeapScanDesc *pscan,
+					 Size rangetbl_len, Size pscan_size)
+{
+	NodeTag		*nodetype;
+	Oid			*scanreliddata;
+	char		*rangetbldata;
+
+	/* Store sequence scan Nodetag in dynamic shared memory. */
+	nodetype = shm_toc_allocate(pcxt->toc, sizeof(NodeTag));
+	*nodetype = T_ParallelSeqScan;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_OPERATION, nodetype);
+
+	/* Store scan relation id in dynamic shared memory. */
+	scanreliddata = shm_toc_allocate(pcxt->toc, sizeof(Index));
+	*scanreliddata = scanrelId;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCANRELID, scanreliddata);
+
+	/* Store range table list in dynamic shared memory. */
+	rangetbldata = shm_toc_allocate(pcxt->toc, rangetbl_len);
+	memcpy(rangetbldata, rangetbl_str, rangetbl_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_RANGETBL, rangetbldata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	*pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(*pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, *pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueueAndStartWorkers
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ * This function must be called after setting up all the other
+ * necessary parallel execution related information as it start
+ * the workers after which we can't initialize or pass the parallel
+ * state information.
+ */
+void
+StoreResponseQueueAndStartWorkers(ParallelContext *pcxt,
+								  shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+
+	/* Register backend workers. */
+	LaunchParallelWorkers(pcxt);
+
+	for (i = 0; i < pcxt->nworkers; ++i)
+		shm_mq_set_handle((*responseqp)[i], pcxt->worker[i].bgwhandle);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Index scanrelId, List *targetList, List *qual,
+						  EState *estate, Relation rel, char **inst_options_space,
+						  shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+						  ParallelHeapScanDesc *pscan, int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	Size		targetlist_len, qual_len, rangetbl_len, params_len, pscan_size;
+	char	   *targetlist_str;
+	char	   *qual_str;
+	char	   *rangetbl_str;
+	ParallelContext *pcxt;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	/* Estimate space for parallel seq. scan specific contents. */
+	targetlist_str = nodeToString(targetList);
+	qual_str = nodeToString(qual);
+	EstimateParallelQueryElemsSpace(pcxt, targetlist_str, qual_str,
+									&targetlist_len, &qual_len);
+
+	rangetbl_str = nodeToString(estate->es_range_table);
+	EstimateParallelSeqScanSpace(pcxt, estate, scanrelId, rangetbl_str,
+								 &rangetbl_len, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 estate->es_instrument, &params_len);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+
+	StoreParallelQueryElems(pcxt, targetlist_str, qual_str,
+							targetlist_len, qual_len);
+	StoreParallelSeqScan(pcxt, estate, rel, scanrelId, rangetbl_str,
+						 pscan, rangetbl_len, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 estate->es_instrument,
+							 params_len, inst_options_space);
+	StoreResponseQueueAndStartWorkers(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelQueryElems
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the targetlist and qualification list required
+ * to perform parallel operation.
+ */
+void
+GetParallelQueryElems(shm_toc *toc, List **targetList, List **qual)
+{
+	char	    *targetlistdata;
+	char		*qualdata;
+
+	targetlistdata = shm_toc_lookup(toc, PARALLEL_KEY_TARGETLIST);
+	qualdata = shm_toc_lookup(toc, PARALLEL_KEY_QUAL);
+
+	*targetList = (List *) stringToNode(targetlistdata);
+	*qual = (List *) stringToNode(qualdata);
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameter's and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument)
+{
+	char		*paramsdata;
+	char		*inst_options_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+}
+
+/*
+ * GetParallelSeqScanInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the scanrelId and rangeTable required to perform
+ * parallel sequential scan.
+ */
+void
+GetParallelSeqScanInfo(shm_toc *toc, Index *scanrelId,
+					   List **rangeTableList)
+{
+	char		*rangetbldata;
+	Index		*scanrel;
+
+	scanrel = shm_toc_lookup(toc, PARALLEL_KEY_SCANRELID);
+	rangetbldata = shm_toc_lookup(toc, PARALLEL_KEY_RANGETBL);
+
+	*scanrelId = *scanrel;
+	*rangeTableList = (List *) stringToNode(rangetbldata);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq)
+{
+	char		*tuple_queue_space;
+	shm_mq_handle *responseq;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	responseq = shm_mq_attach(*mq, seg, NULL);
+
+	/* Redirect protocol messages to responseq. */
+	pq_redirect_to_tuple_shm_mq(responseq);
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	NodeTag		*nodetype;
+
+	nodetype = shm_toc_lookup(toc, PARALLEL_KEY_OPERATION);
+
+	switch (*nodetype)
+	{
+		case T_ParallelSeqScan:
+			RestoreAndExecuteParallelScan(seg, toc);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) *nodetype);
+			break;
+	}
+}
+
+/*
+ * RestoreAndExecuteParallelScan
+ *
+ * Lookup the parallel sequence scan related parameters
+ * from dynamic shared memory segment and setup the
+ * statement to execute the scan.
+ */
+void
+RestoreAndExecuteParallelScan(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq		*mq;
+	List		*targetList = NIL;
+	List		*qual = NIL;
+	List		*rangeTableList = NIL;
+	ParamListInfo params;
+	int			inst_options;
+	char		*instrument = NULL;
+	Index		scanrelId;
+	ParallelScanStmt	*parallelscan;
+
+	SetupResponseQueue(seg, toc, &mq);
+
+	GetParallelQueryElems(toc, &targetList, &qual);
+	GetParallelSeqScanInfo(toc, &scanrelId, &rangeTableList);
+	GetParallelSupportInfo(toc, &params, &inst_options, &instrument);
+
+	parallelscan = palloc(sizeof(ParallelScanStmt));
+
+	parallelscan->scanrelId = scanrelId;
+	parallelscan->targetList = targetList;
+	parallelscan->qual = qual;
+	parallelscan->rangetableList = rangeTableList;
+	parallelscan->params	= params;
+	parallelscan->inst_options = inst_options;
+	parallelscan->instrument = instrument;
+	parallelscan->toc = toc;
+	parallelscan->shm_toc_scan_key = PARALLEL_KEY_SCAN;
+
+	/* Execute the worker command. */
+	exec_parallel_scan(parallelscan);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ac431e5..4c303dd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..e7ebc1f 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -104,6 +104,7 @@ CreateDestReceiver(CommandDest dest)
 	{
 		case DestRemote:
 		case DestRemoteExecute:
+		case DestRemoteBackend:
 			return printtup_create_DR(dest);
 
 		case DestNone:
@@ -146,12 +147,22 @@ EndCommand(const char *commandTag, CommandDest dest)
 	{
 		case DestRemote:
 		case DestRemoteExecute:
+		case DestRemoteBackend:
 
 			/*
-			 * We assume the commandTag is plain ASCII and therefore requires
-			 * no encoding conversion.
+			 * Send the message via shared-memory tuple queue, if the same
+			 * is enabled.
 			 */
-			pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			if (is_tuple_shm_mq_enabled())
+				mq_putmessage_direct('C', commandTag, strlen(commandTag) + 1);
+			else
+			{
+				/*
+				 * We assume the commandTag is plain ASCII and therefore requires
+				 * no encoding conversion.
+				 */
+				pq_putmessage('C', commandTag, strlen(commandTag) + 1);
+			}
 			break;
 
 		case DestNone:
@@ -204,6 +215,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestRemoteBackend:
 			break;
 	}
 }
@@ -248,6 +260,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestRemoteBackend:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8899448..2e42aa2 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -55,6 +55,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1191,6 +1192,90 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * execute_worker_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_scan(ParallelScanStmt *parallelscan)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	PlannedStmt	*planned_stmt;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+	BeginCommand("SELECT", DestNone);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	planned_stmt = create_worker_seqscan_plannedstmt(parallelscan);
+
+	if (parallelscan->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestRemoteBackend);
+		SetRemoteDestReceiverParams(receiver, NULL);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(planned_stmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelscan->params,
+								parallelscan->inst_options);
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelscan->inst_options)
+		memcpy(parallelscan->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelscan->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	/*
+	 * Send appropriate CommandComplete to client.  There is no
+	 * need to send completion tag from worker as that won't be
+	 * of any use considering the completiong tag of master backend
+	 * will be used for sending to client.
+	 */
+	EndCommand("", DestRemoteBackend);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de988ba..b348bad 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -622,6 +622,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2437,6 +2439,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2624,6 +2636,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..784cfe0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -287,6 +287,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -497,6 +500,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 0685e64..9d3d5e5 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -47,6 +47,8 @@ typedef struct ParallelContext
 extern bool ParallelMessagePending;
 extern int ParallelWorkerNumber;
 
+extern int ParallelWorkerNumber;
+
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExternalFunction(char *library_name, char *function_name, int nworkers);
 extern void InitializeParallelDSM(ParallelContext *);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index f459020..7a7bf75 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -115,4 +115,13 @@ typedef struct SysScanDescData
 	Snapshot	snapshot;		/* snapshot to unregister at end of scan */
 }	SysScanDescData;
 
+/* struct for scanning shared memory queues */
+typedef struct ShmScanDescData
+{
+	/* scan current state */
+	int			num_shm_queues;	/* number of shared memory queues used in scan. */
+	int			ss_cqueue;		/* current queue # in scan, if any */
+	bool		shmscan_inited;		/* false = scan not init'd yet */
+}	ShmScanDescData;
+
 #endif   /* RELSCAN_H */
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..f3668ae
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	FmgrInfo   *receive_functions;
+	Oid		   *typioparams;
+	HeapTuple  tuple;
+	int		   num_shm_queues;
+	bool	   *queue_detached;
+	bool	   all_queues_detached;
+	bool	   all_heap_fetched;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+typedef struct ShmScanDescData *ShmScanDesc;
+
+extern worker_result ExecInitWorkerResult(TupleDesc tupdesc, int nWorkers);
+extern ShmScanDesc shm_beginscan(int num_queues);
+extern HeapTuple shm_getnext(HeapScanDesc scanDesc, ShmScanDesc shmScan,
+							 worker_result resultState, shm_mq_handle **responseq,
+							 TupleDesc tupdesc, ScanDirection direction, bool *fromheap);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..e8522fe 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,6 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeParallelSeqscan.h b/src/include/executor/nodeParallelSeqscan.h
new file mode 100644
index 0000000..b638a24
--- /dev/null
+++ b/src/include/executor/nodeParallelSeqscan.h
@@ -0,0 +1,33 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeparallelSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeParallelSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARALLELSEQSCAN_H
+#define NODEPARALLELSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern ParallelSeqScanState *ExecInitParallelSeqScan(ParallelSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecParallelSeqScan(ParallelSeqScanState *node);
+extern void ExecEndParallelSeqScan(ParallelSeqScanState *node);
+
+extern Size EstimateScanRelationIdSpace(Oid relId);
+extern void SerializeScanRelationId(Oid relId, Size maxsize,
+									char *start_address);
+extern void RestoreScanRelationId(Oid *relId, char *start_address);
+
+extern Size EstimateTargetListSpace(List *targetList);
+extern void SerializeTargetList(List *targetList, Size maxsize,
+								char *start_address);
+extern void RestoreTargetList(List **targetList, char *start_address);
+
+#endif   /* NODEPARALLELSEQSCAN_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 48f84bf..e5dec1e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -127,6 +127,8 @@ typedef struct TupleTableSlot
 	MinimalTuple tts_mintuple;	/* minimal tuple, or NULL if none */
 	HeapTupleData tts_minhdr;	/* workspace for minimal-tuple-only case */
 	long		tts_off;		/* saved state for slot_deform_tuple */
+	bool		tts_fromheap;	/* indicates whether the tuple is fetched from
+								   heap or shrared memory message queue */
 } TupleTableSlot;
 
 #define TTS_HAS_PHYSICAL_TUPLE(slot)  \
diff --git a/src/include/libpq/pqmq.h b/src/include/libpq/pqmq.h
index ad7589d..067edbe 100644
--- a/src/include/libpq/pqmq.h
+++ b/src/include/libpq/pqmq.h
@@ -19,6 +19,13 @@
 extern void	pq_redirect_to_shm_mq(shm_mq *, shm_mq_handle *);
 extern void pq_set_parallel_master(pid_t pid, BackendId backend_id);
 
+extern int
+mq_putmessage_direct(char msgtype, const char *s, size_t len);
+extern void
+pq_redirect_to_tuple_shm_mq(shm_mq_handle *mqh);
+extern bool
+is_tuple_shm_mq_enabled(void);
+
 extern void pq_parse_errornotice(StringInfo str, ErrorData *edata);
 
 #endif   /* PQMQ_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41288ed..844a9eb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,12 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -1212,6 +1215,24 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * ParallelScanState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct ParallelScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle **responseq;
+	ShmScanDesc pss_currentShmScanDesc;
+	worker_result	pss_workerResult;
+	char	*inst_options_space;
+} ParallelScanState;
+
+typedef ParallelScanState ParallelSeqScanState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 97ef0fc..b6f1493 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_ParallelSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +98,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_ParallelSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +219,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_ParallelSeqPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index 5b096c5..eb8c86a 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b1dfa85..929937d 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,9 +20,14 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
+#include "nodes/params.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "nodes/params.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +161,19 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for execution. */
+typedef struct ParallelScanStmt
+{
+	Index		scanrelId;
+	List		*targetList;
+	List		*qual;
+	List		*rangetableList;
+	ParamListInfo params;
+	shm_toc		*toc;
+	uint64		shm_toc_scan_key;
+	int			inst_options;
+	char		*instrument;
+} ParallelScanStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 316c9ce..2ae52dd 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -278,6 +280,23 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct ParallelSeqScan
+{
+	Scan		scan;
+	/*
+	 * Non-zero values of toc and shm_toc_key indicates that this
+	 * node will be used for execution of parallel scan in worker
+	 * backend.
+	 */
+	shm_toc		*toc;
+	uint64		shm_toc_key;
+	int			num_workers;
+} ParallelSeqScan;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..c5eb319 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,12 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct ParallelSeqPath
+{
+	Path		path;
+	int			num_workers;
+} ParallelSeqPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..0b6a469 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_parallelseqscan(ParallelSeqPath *path, PlannerInfo *root,
+			 RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..32c3e0d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern ParallelSeqPath *create_parallelseqscan_path(PlannerInfo *root,
+					RelOptInfo *rel, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 082f7d7..eb6be5a 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -41,6 +41,8 @@ extern Plan *optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
  * prototypes for plan/createplan.c
  */
 extern Plan *create_plan(PlannerInfo *root, Path *best_path);
+extern SeqScan *
+create_worker_seqscan_plan(ParallelScanStmt *parallelscan);
 extern SubqueryScan *make_subqueryscan(List *qptlist, List *qpqual,
 				  Index scanrelid, Plan *subplan);
 extern ForeignScan *make_foreignscan(List *qptlist, List *qpqual,
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..c2aa875 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *
+create_worker_seqscan_plannedstmt(ParallelScanStmt *parallelscan);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..6d0b590
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,32 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Index scanrelId, List *targetList,
+									  List *qual, EState *estate,
+									  Relation rel, char **inst_options_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  ParallelHeapScanDesc *pscan,
+									  int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..dd176b5 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -89,6 +89,7 @@ typedef enum
 	DestDebug,					/* results go to debugging output */
 	DestRemote,					/* results sent to frontend process */
 	DestRemoteExecute,			/* sent to frontend, in Execute command */
+	DestRemoteBackend,			/* parallel worker send results to master backend */
 	DestSPI,					/* results sent to SPI manager */
 	DestTuplestore,				/* results sent to Tuplestore */
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 3e17770..9eebc51 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_scan(ParallelScanStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#172Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#169)
Re: Parallel Seq Scan

On 2015-02-11 15:49:17 -0500, Robert Haas wrote:

On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:

And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

If it's true that a good chunk size depends on the higher layers, then
that would be a good argument for doing this differently, or at least
exposing an API for the higher layers to tell heapam.c what chunk size
they want. I hadn't considered that possibility - can you elaborate
on why you think we might want to vary the chunk size?

Because things like chunk size depend on the shape of the entire
plan. If you have a 1TB table and want to sequentially scan it in
parallel with 10 workers you better use some rather large chunks. That
way readahead will be efficient in a cpu/socket local manner,
i.e. directly reading in the pages into the directly connected memory of
that cpu. Important for performance on a NUMA system, otherwise you'll
constantly have everything go over the shared bus. But if you instead
have a plan where the sequential scan goes over a 1GB table, perhaps
with some relatively expensive filters, you'll really want a small
chunks size to avoid waiting.

I see. That makes sense.

The chunk size will also really depend on
what other nodes are doing, at least if they can run in the same worker.

Example?

A query whose runetime is dominated by a sequential scan (+ attached
filter) is certainly going to require a bigger prefetch size than one
that does other expensive stuff.

Imagine parallelizing
SELECT * FROM largetable WHERE col = low_cardinality_value;
and
SELECT *
FROM largetable JOIN gigantic_table ON (index_nestloop_condition)
WHERE col = high_cardinality_value;

The first query will be a simple sequential and disk reads on largetable
will be the major cost of executing it. In contrast the second query
might very well sensibly be planned as a parallel sequential scan with
the nested loop executing in the same worker. But the cost of the
sequential scan itself will likely be completely drowned out by the
nestloop execution - index probes are expensive/unpredictable.

My guess is that the batch size can wil have to be computed based on the
fraction of cost of the parallized work it has.

Even without things like NUMA and readahead I'm pretty sure that you'll
want a chunk size a good bit above one page. The locks we acquire for
the buffercache lookup and for reading the page are already quite bad
for performance/scalability; even if we don't always/often hit the same
lock. Making 20 processes that scan pages in parallel acquire yet a
another lock (that's shared between all of them!) for every single page
won't be fun, especially without or fast filters.

This is possible, but I'm skeptical. If the amount of other work we
have to do that page is so little that the additional spinlock cycle
per page causes meaningful contention, I doubt we should be
parallelizing in the first place.

It's easy to see contention of buffer mapping (many workloads), buffer
content and buffer header (especially btree roots and small foreign key
target tables) locks. And for most of them we already avoid acquiring
the same spinlock in all backends.

Right now to process a page in a sequential scan we acquire a
nonblocking buffer mapping lock (which doesn't use a spinlock anymore
*because* it proved to be a bottleneck), a nonblocking content lock and
a the buffer header spinlock. All of those are essentially partitioned -
another spinlock shared between all workers will show up.

As pointed out above (moved there after reading the patch...) I don't
think a chunk size of 1 or any other constant size can make sense. I
don't even believe it'll necessarily be constant across an entire query
execution (big initially, small at the end). Now, we could move
determining that before the query execution into executor
initialization, but then we won't yet know how many workers we're going
to get. We could add a function setting that at runtime, but that'd mix
up responsibilities quite a bit.

I still think this belongs in heapam.c somehow or other. If the logic
is all in the executor, then it becomes impossible for any code that
doensn't use the executor to do a parallel heap scan, and that's
probably bad. It's not hard to imagine something like CLUSTER wanting
to reuse that code, and that won't be possible if the logic is up in
some higher layer.

Yea.

If the logic we want is to start with a large chunk size and then
switch to a small chunk size when there's not much of the relation
left to scan, there's still no reason that can't be encapsulated in
heapam.c.

I don't mind having some logic in there, but I think you put in too
much. The snapshot stuff should imo go, and the next page logic should
be caller provided.

Btw, using a atomic uint32 you'd end up without the spinlock and just
about the same amount of code... Just do a atomic_fetch_add_until32(var,
1, InvalidBlockNumber)... ;)

I thought of that, but I think there's an overflow hazard.

That's why I said atomic_fetch_add_until32 - which can't overflow ;). I
now remember that that was actually pulled on Heikki's request from the
commited patch until a user shows up, but we can easily add it
back. compare/exchange makes such things simple luckily.

To me, given the existing executor code, it seems easiest to achieve
that by having the ParallelismDrivingNode above having a dynamic number
of nestloop children in different backends and point the coordinated
seqscan to some shared state. As you point out, the number of these
children cannot be certainly known (just targeted for) at plan time;
that puts a certain limit on how independent they are. But since a
large number of them can be independent between workers it seems awkward
to generally treat them as being the same node across workers. But maybe
that's just an issue with my mental model.

I think it makes sense to think of a set of tasks in which workers can
assist. So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node. To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#173Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#172)
Re: Parallel Seq Scan

On Tue, Feb 17, 2015 at 9:52 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2015-02-11 15:49:17 -0500, Robert Haas wrote:

A query whose runetime is dominated by a sequential scan (+ attached
filter) is certainly going to require a bigger prefetch size than one
that does other expensive stuff.

Imagine parallelizing
SELECT * FROM largetable WHERE col = low_cardinality_value;
and
SELECT *
FROM largetable JOIN gigantic_table ON (index_nestloop_condition)
WHERE col = high_cardinality_value;

The first query will be a simple sequential and disk reads on largetable
will be the major cost of executing it. In contrast the second query
might very well sensibly be planned as a parallel sequential scan with
the nested loop executing in the same worker. But the cost of the
sequential scan itself will likely be completely drowned out by the
nestloop execution - index probes are expensive/unpredictable.

I think the work/task given to each worker should be as granular
as possible to make it more predictable.
I think the better way to parallelize such a work (Join query) is that
first worker does sequential scan and filtering on large table and
then pass it to next worker for doing join with gigantic_table.

I think it makes sense to think of a set of tasks in which workers can
assist. So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node. To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

There could be some cases where it could be beneficial for worker
to process a sub-tree, but I think there will be more cases where
it will just work on a part of node and send the result back to either
master backend or another worker for further processing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#174Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#173)
Re: Parallel Seq Scan

On 2015-02-18 16:59:26 +0530, Amit Kapila wrote:

On Tue, Feb 17, 2015 at 9:52 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

A query whose runetime is dominated by a sequential scan (+ attached
filter) is certainly going to require a bigger prefetch size than one
that does other expensive stuff.

Imagine parallelizing
SELECT * FROM largetable WHERE col = low_cardinality_value;
and
SELECT *
FROM largetable JOIN gigantic_table ON (index_nestloop_condition)
WHERE col = high_cardinality_value;

The first query will be a simple sequential and disk reads on largetable
will be the major cost of executing it. In contrast the second query
might very well sensibly be planned as a parallel sequential scan with
the nested loop executing in the same worker. But the cost of the
sequential scan itself will likely be completely drowned out by the
nestloop execution - index probes are expensive/unpredictable.

I think the work/task given to each worker should be as granular
as possible to make it more predictable.
I think the better way to parallelize such a work (Join query) is that
first worker does sequential scan and filtering on large table and
then pass it to next worker for doing join with gigantic_table.

I'm pretty sure that'll result in rather horrible performance. IPC is
rather expensive, you want to do as little of it as possible.

I think it makes sense to think of a set of tasks in which workers can
assist. So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node. To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

There could be some cases where it could be beneficial for worker
to process a sub-tree, but I think there will be more cases where
it will just work on a part of node and send the result back to either
master backend or another worker for further processing.

I think many parallelism projects start out that way, and then notice
that it doesn't parallelize very efficiently.

The most extreme example, but common, is aggregation over large amounts
of data - unless you want to ship huge amounts of data between processes
eto parallize it you have to do the sequential scan and the
pre-aggregate step (that e.g. selects count() and sum() to implement a
avg over all the workers) inside one worker.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#175Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#174)
Re: Parallel Seq Scan

On Wed, Feb 18, 2015 at 6:44 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2015-02-18 16:59:26 +0530, Amit Kapila wrote:

There could be some cases where it could be beneficial for worker
to process a sub-tree, but I think there will be more cases where
it will just work on a part of node and send the result back to either
master backend or another worker for further processing.

I think many parallelism projects start out that way, and then notice
that it doesn't parallelize very efficiently.

The most extreme example, but common, is aggregation over large amounts
of data - unless you want to ship huge amounts of data between processes
eto parallize it you have to do the sequential scan and the
pre-aggregate step (that e.g. selects count() and sum() to implement a
avg over all the workers) inside one worker.

OTOH if someone wants to parallelize scan (including expensive qual) and
sort then it will be better to perform scan (or part of scan by one worker)
and sort by other worker.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#176Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#175)
Re: Parallel Seq Scan

On Sat, Feb 21, 2015 at 12:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 18, 2015 at 6:44 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2015-02-18 16:59:26 +0530, Amit Kapila wrote:

There could be some cases where it could be beneficial for worker
to process a sub-tree, but I think there will be more cases where
it will just work on a part of node and send the result back to either
master backend or another worker for further processing.

I think many parallelism projects start out that way, and then notice
that it doesn't parallelize very efficiently.

The most extreme example, but common, is aggregation over large amounts
of data - unless you want to ship huge amounts of data between processes
eto parallize it you have to do the sequential scan and the
pre-aggregate step (that e.g. selects count() and sum() to implement a
avg over all the workers) inside one worker.

OTOH if someone wants to parallelize scan (including expensive qual) and
sort then it will be better to perform scan (or part of scan by one worker)
and sort by other worker.

There exists a performance problem if we perform SCAN in one worker
and SORT operation in another worker,
because there is a need of twice tuple transfer between worker to
worker/backend. This is a costly operation.
It is better to combine SCAN and SORT operation into a one worker job.
This can be targeted once the parallel scan
code is stable.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#177Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#172)
Re: Parallel Seq Scan

On Tue, Feb 17, 2015 at 11:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I still think this belongs in heapam.c somehow or other. If the logic
is all in the executor, then it becomes impossible for any code that
doensn't use the executor to do a parallel heap scan, and that's
probably bad. It's not hard to imagine something like CLUSTER wanting
to reuse that code, and that won't be possible if the logic is up in
some higher layer.

Yea.

If the logic we want is to start with a large chunk size and then
switch to a small chunk size when there's not much of the relation
left to scan, there's still no reason that can't be encapsulated in
heapam.c.

I don't mind having some logic in there, but I think you put in too
much. The snapshot stuff should imo go, and the next page logic should
be caller provided.

If we need to provide a way for the caller to provide the next-page
logic, then I think that should be done via configuration arguments or
flags, not a callback. There's just no way that the needs of the
executor are going to be so radically different from a utility command
that only a callback will do.

I think it makes sense to think of a set of tasks in which workers can
assist. So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node. To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

Amit and I had a long discussion about this on Friday while in Boston
together. I previously argued that the master and the slave should be
executing the same node, ParallelSeqScan. However, Amit argued
persuasively that what the master is doing is really pretty different
from what the worker is doing, and that they really ought to be
running two different nodes. This led us to cast about for a better
design, and we came up with something that I think will be much
better.

The basic idea is to introduce a new node called Funnel. A Funnel
node will have a left child but no right child, and its job will be to
fire up a given number of workers. Each worker will execute the plan
which is the left child of the funnel. The funnel node itself will
pull tuples from all of those workers, and can also (if there are no
tuples available from any worker) execute the plan itself. So a
parallel sequential scan will look something like this:

Funnel
Workers: 4
-> Partial Heap Scan on xyz

What this is saying is that each worker is going to scan part of the
heap for xyz; together, they will scan the whole thing.

The neat thing about this way of separating things out is that we can
eventually write code to push more stuff down into the funnel. For
example, consider this:

Nested Loop
-> Seq Scan on foo
-> Index Scan on bar
Index Cond: bar.x = foo.x

Now, if a parallel sequential scan is cheaper than a regular
sequential scan, we can instead do this:

Nested Loop
-> Funnel
-> Partial Heap Scan on foo
-> Index Scan on bara
Index Cond: bar.x = foo.x

The problem with this is that the nested loop/index scan is happening
entirely in the master. But we can have logic that fixes that by
knowing that a nested loop can be pushed through a funnel, yielding
this:

Funnel
-> Nested Loop
-> Partial Heap Scan on foo
-> Index Scan on bar
Index Cond: bar.x = foo.x

Now that's pretty neat. One can also imagine doing this with
aggregates. Consider:

HashAggregate
-> Funnel
-> Partial Heap Scan on foo
Filter: x = 1

Here, we can't just push the HashAggregate through the filter, but
given infrastructure for we could convert that to something like this:

HashAggregateFinish
-> Funnel
-> HashAggregatePartial
-> Partial Heap Scan on foo
Filter: x = 1

That'd be swell.

You can see that something like this will also work for breaking off
an entire plan tree and shoving it down into a worker. The master
can't participate in the computation in that case, but it's otherwise
the same idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#178Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Robert Haas (#177)
Re: Parallel Seq Scan

Amit and I had a long discussion about this on Friday while in Boston
together. I previously argued that the master and the slave should be
executing the same node, ParallelSeqScan. However, Amit argued
persuasively that what the master is doing is really pretty different
from what the worker is doing, and that they really ought to be
running two different nodes. This led us to cast about for a better
design, and we came up with something that I think will be much
better.

The basic idea is to introduce a new node called Funnel. A Funnel
node will have a left child but no right child, and its job will be to
fire up a given number of workers. Each worker will execute the plan
which is the left child of the funnel. The funnel node itself will
pull tuples from all of those workers, and can also (if there are no
tuples available from any worker) execute the plan itself. So a
parallel sequential scan will look something like this:

Funnel
Workers: 4
-> Partial Heap Scan on xyz

What this is saying is that each worker is going to scan part of the
heap for xyz; together, they will scan the whole thing.

What is the best way to determine the number of workers?
Fixed number is an idea. It may also make sense to add a new common field
to Path node to introduce how much portion of the node execution can be
parallelized, or unavailable to run in parallel.
Not on the plan time, we may be able to determine the number according to
the number of concurrent workers and number of CPU cores.

The neat thing about this way of separating things out is that we can
eventually write code to push more stuff down into the funnel. For
example, consider this:

Nested Loop
-> Seq Scan on foo
-> Index Scan on bar
Index Cond: bar.x = foo.x

Now, if a parallel sequential scan is cheaper than a regular
sequential scan, we can instead do this:

Nested Loop
-> Funnel
-> Partial Heap Scan on foo
-> Index Scan on bara
Index Cond: bar.x = foo.x

The problem with this is that the nested loop/index scan is happening
entirely in the master. But we can have logic that fixes that by
knowing that a nested loop can be pushed through a funnel, yielding
this:

Funnel
-> Nested Loop
-> Partial Heap Scan on foo
-> Index Scan on bar
Index Cond: bar.x = foo.x

Now that's pretty neat. One can also imagine doing this with
aggregates. Consider:

I guess the planner enhancement shall exist around add_paths_to_joinrel().
In case when any underlying join paths that support multi-node execution,
the new portion will add Funnel node with these join paths. Just my thought.

HashAggregate
-> Funnel
-> Partial Heap Scan on foo
Filter: x = 1

Here, we can't just push the HashAggregate through the filter, but
given infrastructure for we could convert that to something like this:

HashAggregateFinish
-> Funnel
-> HashAggregatePartial
-> Partial Heap Scan on foo
Filter: x = 1

That'd be swell.

You can see that something like this will also work for breaking off
an entire plan tree and shoving it down into a worker. The master
can't participate in the computation in that case, but it's otherwise
the same idea.

I believe the entire vision we've discussed around combining aggregate
function thread is above, although people primarily considers to apply
this feature on aggregate push-down across join.

One key infrastructure may be a capability to define the combining function
of aggregates. It informs the planner given aggregation support two stage
execution. In addition to this, we may need to have a planner enhancement
to inject the partial aggregate node during path construction.

Probably, we have to set a flag to inform later stage (that will construct
Agg plan) the underlying scan/join node takes partial aggregation, thus,
final aggregation has to expect state data, instead of usual arguments for
row-by-row.

Also, I think HashJoin with very large outer relation but unbalanced much
small inner is a good candidate to distribute multiple nodes.
Even if multi-node HashJoin has to read the small inner relation N-times,
separation of very large outer relation will make gain.

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#179Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#177)
1 attachment(s)
Re: Parallel Seq Scan

On Sun, Feb 22, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 17, 2015 at 11:22 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

Amit and I had a long discussion about this on Friday while in Boston
together. I previously argued that the master and the slave should be
executing the same node, ParallelSeqScan. However, Amit argued
persuasively that what the master is doing is really pretty different
from what the worker is doing, and that they really ought to be
running two different nodes. This led us to cast about for a better
design, and we came up with something that I think will be much
better.

The basic idea is to introduce a new node called Funnel. A Funnel
node will have a left child but no right child, and its job will be to
fire up a given number of workers. Each worker will execute the plan
which is the left child of the funnel. The funnel node itself will
pull tuples from all of those workers, and can also (if there are no
tuples available from any worker) execute the plan itself.

I have modified the patch to introduce a Funnel node (and left child
as PartialSeqScan node). Apart from that, some other noticeable
changes based on feedback include:
a) Master backend forms and send the planned stmt to each worker,
earlier patch use to send individual elements and form the planned
stmt in each worker.
b) Passed tuples directly via tuple queue instead of going via
FE-BE protocol.
c) Removed restriction of expressions in target list.
d) Introduced a parallelmodeneeded flag in plannerglobal structure
and set it for Funnel plan.

There is still some work left like integrating with
access-parallel-safety patch (use parallelmodeok flag to decide
whether parallel path can be generated, Enter/Exit parallel mode is still
done during execution of funnel node).

I think these are minor points which can be fixed once we decide
on the other major parts of patch. Find modified patch attached with
this mail.

Note -
This patch is based on Head (commit-id: d1479011) +
parallel-mode-v6.patch [1]/messages/by-id/CA+TgmobCMwFOz-9=hFv=hJ4SH7p=5X6Ga5V=WtT8=huzE6C+Mg@mail.gmail.com + parallel-heap-scan.patch[2]/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

[1]: /messages/by-id/CA+TgmobCMwFOz-9=hFv=hJ4SH7p=5X6Ga5V=WtT8=huzE6C+Mg@mail.gmail.com
/messages/by-id/CA+TgmobCMwFOz-9=hFv=hJ4SH7p=5X6Ga5V=WtT8=huzE6C+Mg@mail.gmail.com
[2]: /messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v8.patchapplication/octet-stream; name=parallel_seqscan_v8.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..d8bd596
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,92 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+//#include "access/tupdesc.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+
+/*
+ * ExecInitWorkerResult
+ *
+ * Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(void)
+{
+	worker_result	workerResult;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+
+	return workerResult;
+}
+
+/*
+ * shm_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(HeapScanDesc scanDesc, worker_result resultState,
+			TupleQueueFunnel *funnel, ScanDirection direction,
+			bool *fromheap)
+{
+	HeapTuple	tup;
+
+	while (!resultState->all_workers_done || !resultState->local_scan_done)
+	{
+		if (!resultState->all_workers_done)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnel, !resultState->local_scan_done,
+									   &resultState->all_workers_done);
+			if (HeapTupleIsValid(tup))
+			{
+				*fromheap = false;
+				return tup;
+			}
+		}
+		if (!resultState->local_scan_done)
+		{
+			tup = heap_getnext(scanDesc, direction);
+			if (HeapTupleIsValid(tup))
+			{
+				*fromheap = true;
+				return tup;
+			}
+			resultState->local_scan_done = true;
+		}
+	}
+
+	return NULL;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a951c55..8410afa 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -916,6 +917,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1065,6 +1069,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1206,6 +1211,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((FunnelState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((FunnelState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1331,6 +1354,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2214,6 +2245,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..991ff51 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -16,14 +16,15 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
+       nodeSeqscan.o nodePartialSeqscan.o nodeSetOp.o nodeSort.o \
+       nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o \
+       spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index 1c8be25..f13b7bc 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 07526e8..9a3e285 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -181,6 +181,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		estate->es_param_exec_vals = (ParamExecData *)
 			palloc0(queryDesc->plannedstmt->nParamExec * sizeof(ParamExecData));
 
+	estate->toc = queryDesc->toc;
+
 	/*
 	 * If non-read-only query, set the command ID to mark output tuples with
 	 */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..1a1275c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +192,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +418,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +664,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 3f0d809..caf9855 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -191,13 +191,20 @@ ExecScan(ScanState *node,
 		 * check for non-nil qual here to avoid a function call to ExecQual()
 		 * when the qual is nil ... saves only a few cycles, but they add up
 		 * ...
+		 *
+		 * check for non-heap tuples (can get such tuples from shared memory
+		 * message queue's in case of parallel query), for such tuples no need
+		 * to perform qualification or projection as for them the same is done
+		 * by worker backend.  This case will happen only for parallel query
+		 * where we push down the qualification and projection (targetlist)
+		 * information.
 		 */
-		if (!qual || ExecQual(qual, econtext, false))
+		if (!slot->tts_fromheap || !qual || ExecQual(qual, econtext, false))
 		{
 			/*
 			 * Found a satisfactory scan tuple.
 			 */
-			if (projInfo)
+			if (projInfo && slot->tts_fromheap)
 			{
 				/*
 				 * Form a projection tuple, store it in the result tuple slot
@@ -211,6 +218,23 @@ ExecScan(ScanState *node,
 					return resultSlot;
 				}
 			}
+			else if (projInfo && !slot->tts_fromheap)
+			{
+				/*
+				 * Store the tuple we got from shared memory tuple queue
+				 * in projection slot as the worker backend wtakes care
+				 * of doing projection.  We don't need to free this tuple
+				 * as this is pointing to scan tuple slot which will take
+				 * care of freeing it.
+				 */
+				ExecStoreTuple(econtext->ecxt_scantuple->tts_tuple,	/* tuple to store */
+							   projInfo->pi_slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   false);	/* pfree this pointer */
+
+				return projInfo->pi_slot;
+			}
 			else
 			{
 				/*
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 753754d..4c5bd88 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -123,6 +123,7 @@ MakeTupleTableSlot(void)
 	slot->tts_values = NULL;
 	slot->tts_isnull = NULL;
 	slot->tts_mintuple = NULL;
+	slot->tts_fromheap	= true;
 
 	return slot;
 }
@@ -473,6 +474,8 @@ ExecClearTuple(TupleTableSlot *slot)	/* slot in which to store tuple */
 	slot->tts_isempty = true;
 	slot->tts_nvalid = 0;
 
+	slot->tts_fromheap = true;
+
 	return slot;
 }
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 022041b..79eeaee 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -145,6 +145,8 @@ CreateExecutorState(void)
 
 	estate->es_auxmodifytables = NIL;
 
+	estate->toc = NULL;
+
 	estate->es_per_tuple_exprcontext = NULL;
 
 	estate->es_epqTuple = NULL;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..56e509d 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,8 @@ BufferUsage pgBufferUsage;
 
 static void BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add, const BufferUsage *sub);
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 
 /* Allocate new instrumentation structure(s) */
@@ -127,6 +129,28 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
 static void
 BufferUsageAccumDiff(BufferUsage *dst,
@@ -148,3 +172,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..71f6daa
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,301 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		FunnelNext				retrieve next tuple from either heap or shared memory segment.
+ *		ExecInitFunnel			creates and initializes a parallel seqscan node.
+ *		ExecEndFunnel			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeFunnel.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		FunnelNext
+ *
+ *		This is a workhorse for ExecFunnel
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+FunnelNext(FunnelState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+	bool			fromheap = true;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(scandesc,
+						node->pss_workerResult,
+						node->funnel,
+						direction,
+						&fromheap);
+
+	slot->tts_fromheap = fromheap;
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass '!fromheap'
+	 * because tuples returned by shm_getnext() are either pointers that are
+	 * created with palloc() or are pointers onto disk pages and so it should
+	 * be pfree()'d accordingly.  Note also that ExecStoreTuple will increment
+	 * the refcount of the buffer; the refcount will not be dropped until the
+	 * tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   fromheap ? scandesc->rs_cbuf : InvalidBuffer, /* buffer associated with this
+																	  * tuple */
+					   !fromheap);	/* pfree this pointer if not from heap */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * FunnelRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+FunnelRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, Funnel never use keys in
+	 * heap_beginscan (and this is very bad) - so, here
+	 * we do not check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitFunnelRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnelRelation(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+	ParallelHeapScanDesc pscan;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/* Initialize the workers required to perform parallel scan. */
+	InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+								estate,
+								currentRelation,
+								&node->inst_options_space,
+								&node->responseq,
+								&node->pcxt,
+								&pscan,
+								((Funnel *)(node->ss.ps.plan))->num_workers);
+
+	currentScanDesc = heap_beginscan_parallel(currentRelation, pscan);
+
+	node->ss.ss_currentRelation = currentRelation;
+	node->ss.ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnelRelation(funnelstate, estate, eflags);
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	funnelstate->pss_workerResult = ExecInitWorkerResult();
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+
+	/*
+	 * if parallel context is set and workers are not
+	 * registered, register them now.
+	 */
+	if (node->pcxt && !node->fs_workersReady)
+	{
+		/* Register backend workers. */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			 shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+			 RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+		}
+
+		node->fs_workersReady = true;
+	}
+
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) FunnelNext,
+					(ExecScanRecheckMtd) FunnelRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	if (node->pcxt)
+	{
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		ExitParallelMode();
+	}
+}
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..fb4efa3
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,259 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodePartialSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss_currentScanDesc;
+	estate = node->ps.state;
+	direction = estate->es_direction;
+	slot = node->ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+	ParallelHeapScanDesc pscan;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.
+	 */
+	Assert(estate->toc);
+	
+	pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+	currentScanDesc = heap_beginscan_parallel(currentRelation, pscan);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ps.plan = (Plan *) node;
+	scanstate->ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ps);
+	ExecInitScanTupleSlot(estate, scanstate);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ps);
+	ExecAssignScanProjectionInfo(scanstate);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss_currentRelation;
+	scanDesc = node->ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..ee4e03e
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,272 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static void
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+			memcpy(&funnel->queue[funnel->nextqueue],
+				   &funnel->queue[funnel->nextqueue + 1],
+				   sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9fe8008..e51fc38 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -354,6 +354,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4044,6 +4081,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 775f482..3382ab2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -439,6 +439,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2886,6 +2904,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..aa278c5 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,187 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	/* sizeof(ParamListInfoData) includes the first array element */
+	size = sizeof(ParamListInfoData) +
+		(num_params - 1) * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 563209c..2bae475 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1280,6 +1280,91 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+
+	READ_DONE();
+}
+
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1409,6 +1494,12 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 78ef229..5f5980f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -219,6 +227,55 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info,
+			int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..3149247
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,121 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "nodes/relation.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/*
+	 * parallel scan is not supported for joins.
+	 */
+	if (root->simple_rel_array_size > 2)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 76ba1bf..744e652 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,11 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path,
+								List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +105,12 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist,
+										   List *qpqual,
+										   Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+						   Index scanrelid, int nworkers,
+						   Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +239,8 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +356,20 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path,
+											   tlist,
+											   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +573,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_Funnel:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1162,87 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path,
+				   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Plan	   *subplan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3318,6 +3428,45 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b02a107..182c70d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,50 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_worker_scan_plannedstmt(PartialSeqScan *partialscan, List *rangetable)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ec828cd..1b63f23 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -435,6 +435,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -445,6 +446,24 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist = 
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel scan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 5a1d539..8ea91ec 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2163,6 +2163,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..c1ffe78 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,53 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+							Path* subpath, int nWorkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nWorkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..0c38e60
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,410 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitializeParallelWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "access/parallel.h"
+#include "commands/dbcommands.h"
+#include "commands/async.h"
+#include "executor/nodeFunnel.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "postmaster/backendworker.h"
+#include "storage/ipc.h"
+#include "storage/procsignal.h"
+#include "storage/procarray.h"
+#include "storage/shm_toc.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space);
+static void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size);
+static void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					 char *plannedstmt_str, ParallelHeapScanDesc *pscan,
+					 Size plannedstmt_size, Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters and instrumentation information that need to be
+ * retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters and instrumentation information
+ * required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space)
+{
+	char	*paramsdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePartialSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size)
+{
+	/* Estimate space for partial seq. scan specific contents. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StorePartialSeqScan
+ * 
+ * Sets up the planned statement and block range for parallel
+ * sequence scan.
+ */
+void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					 char *plannedstmt_str, ParallelHeapScanDesc *pscan,
+					 Size plannedstmt_size, Size pscan_size)
+{
+	char		*plannedstmtdata;
+
+	/* Store range table list in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	*pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(*pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, *pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Plan *plan, EState *estate, Relation rel,
+						  char **inst_options_space,
+						  shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+						  ParallelHeapScanDesc *pscan, int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	Size		params_size, pscan_size, plannedstmt_size;
+	char	   *plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_worker_scan_plannedstmt((PartialSeqScan *)plan,
+												 estate->es_range_table);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePartialSeqScanSpace(pcxt, estate, plannedstmt_str,
+								&plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 estate->es_instrument, &params_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePartialSeqScan(pcxt, estate, rel, plannedstmt_str,
+						pscan, plannedstmt_size, pscan_size);
+
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 estate->es_instrument,
+							 params_size, inst_options_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameter's and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument)
+{
+	char		*paramsdata;
+	char		*inst_options_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+}
+
+/*
+ * GetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->qual);
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->targetlist);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	int				inst_options;
+	char			*instrument = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	GetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &inst_options, &instrument);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->toc = toc;
+	parallelstmt->responseq = responseq;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ac431e5..4c303dd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..7a9ce3e 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -129,6 +130,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +166,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +209,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +254,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ea2a432..17f322f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -55,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1191,6 +1193,80 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	if (parallelstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestTupleQueue);
+		SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	queryDesc->toc = parallelstmt->toc;
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0bbc67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -80,6 +80,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->params = params;		/* parameter values passed into query */
 	qd->instrument_options = instrument_options;		/* instrumentation
 														 * wanted? */
+	qd->toc = NULL;		/* need to be set by the caller before ExecutorStart */
 
 	/* null these fields until set by ExecutorStart */
 	qd->tupDesc = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 791543e..abc2b8f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -608,6 +608,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2537,6 +2539,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2724,6 +2736,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f8f9ce1..fbe6042 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -290,6 +290,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -500,6 +503,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 0685e64..9d3d5e5 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -47,6 +47,8 @@ typedef struct ParallelContext
 extern bool ParallelMessagePending;
 extern int ParallelWorkerNumber;
 
+extern int ParallelWorkerNumber;
+
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExternalFunction(char *library_name, char *function_name, int nworkers);
 extern void InitializeParallelDSM(ParallelContext *);
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..80d06ac
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "executor/tqueue.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	bool		all_workers_done;
+	bool		local_scan_done;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+extern worker_result ExecInitWorkerResult(void);
+extern HeapTuple shm_getnext(HeapScanDesc scanDesc, worker_result resultState,
+							 TupleQueueFunnel *funnel, ScanDirection direction,
+							 bool *fromheap);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index a2381cd..56b7c75 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,7 @@ typedef struct QueryDesc
 	DestReceiver *dest;			/* the destination for tuple output */
 	ParamListInfo params;		/* param values being passed in */
 	int			instrument_options;		/* OR of InstrumentOption flags */
+	shm_toc		*toc;			/* to fetch the information from dsm */
 
 	/* These fields are set by ExecutorStart */
 	TupleDesc	tupDesc;		/* descriptor for result tuples */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..e8522fe 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,6 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..df7e11e
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodefunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..f02bcca
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..c979233
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,34 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 48f84bf..e5dec1e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -127,6 +127,8 @@ typedef struct TupleTableSlot
 	MinimalTuple tts_mintuple;	/* minimal tuple, or NULL if none */
 	HeapTupleData tts_minhdr;	/* workspace for minimal-tuple-only case */
 	long		tts_off;		/* saved state for slot_deform_tuple */
+	bool		tts_fromheap;	/* indicates whether the tuple is fetched from
+								   heap or shrared memory message queue */
 } TupleTableSlot;
 
 #define TTS_HAS_PHYSICAL_TUPLE(slot)  \
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 59b17f3..323b35b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,9 +16,13 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/shm_mq.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -389,6 +393,12 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1213,6 +1223,29 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * PartialSeqScan uses a bare SeqScanState as its state node, since
+ * it needs no additional fields.
+ */
+typedef SeqScanState PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	worker_result	pss_workerResult;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	bool			fs_workersReady;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 97ef0fc..6acbe67 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +99,8 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +221,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..65b60a0 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ac13302..ea8e240 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,9 +20,16 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
+#include "nodes/params.h"
+#include "nodes/plannodes.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "nodes/params.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
+#include "storage/shm_mq.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +163,16 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	shm_toc			*toc;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+} ParallelStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f6683f0..8099f78 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -279,6 +281,22 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..df1ab5e 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -103,6 +103,8 @@ typedef struct PlannerGlobal
 
 	bool		hasRowSecurity;	/* row security applied? */
 
+	bool		parallelModeNeeded; /* parallel plans need parallelmode */
+
 } PlannerGlobal;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -737,6 +739,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..11f0409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..7873565 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,11 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..3b7ed92 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*
+create_worker_scan_plannedstmt(PartialSeqScan *partialscan, List *rangetable);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..1d05d79
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,39 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define PARALLEL_KEY_INST_OPTIONS	2
+#define PARALLEL_KEY_INST_INFO		3
+#define PARALLEL_KEY_TUPLE_QUEUE	4
+#define PARALLEL_KEY_SCAN			5
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Plan *plan, EState *estate,
+									  Relation rel, char **inst_options_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  ParallelHeapScanDesc *pscan,
+									  int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..b560672 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 3e17770..489af46 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#180Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#179)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Mar 4, 2015 at 6:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Feb 22, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Tue, Feb 17, 2015 at 11:22 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

Amit and I had a long discussion about this on Friday while in Boston
together. I previously argued that the master and the slave should be
executing the same node, ParallelSeqScan. However, Amit argued
persuasively that what the master is doing is really pretty different
from what the worker is doing, and that they really ought to be
running two different nodes. This led us to cast about for a better
design, and we came up with something that I think will be much
better.

The basic idea is to introduce a new node called Funnel. A Funnel
node will have a left child but no right child, and its job will be to
fire up a given number of workers. Each worker will execute the plan
which is the left child of the funnel. The funnel node itself will
pull tuples from all of those workers, and can also (if there are no
tuples available from any worker) execute the plan itself.

I have modified the patch to introduce a Funnel node (and left child
as PartialSeqScan node). Apart from that, some other noticeable
changes based on feedback include:
a) Master backend forms and send the planned stmt to each worker,
earlier patch use to send individual elements and form the planned
stmt in each worker.
b) Passed tuples directly via tuple queue instead of going via
FE-BE protocol.
c) Removed restriction of expressions in target list.
d) Introduced a parallelmodeneeded flag in plannerglobal structure
and set it for Funnel plan.

There is still some work left like integrating with
access-parallel-safety patch (use parallelmodeok flag to decide
whether parallel path can be generated, Enter/Exit parallel mode is still
done during execution of funnel node).

I think these are minor points which can be fixed once we decide
on the other major parts of patch. Find modified patch attached with
this mail.

Note -
This patch is based on Head (commit-id: d1479011) +
parallel-mode-v6.patch [1] + parallel-heap-scan.patch[2]

[1]

/messages/by-id/CA+TgmobCMwFOz-9=hFv=hJ4SH7p=5X6Ga5V=WtT8=huzE6C+Mg@mail.gmail.com

[2]

/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Assuming previous patch is in right direction, I have enabled
join support for the patch and done some minor cleanup of
patch which leads to attached new version.

It is based on commit-id:5a2a48f0 and parallel-mode-v7.patch
and parallel-heap-scan.patch

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v9.patchapplication/octet-stream; name=parallel_seqscan_v9.patchDownload
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 21721b4..823d5c3 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc shmmq spgist transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 383e15b..d384e8f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1582,6 +1582,20 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/access/shmmq/Makefile b/src/backend/access/shmmq/Makefile
new file mode 100644
index 0000000..aeae8d9
--- /dev/null
+++ b/src/backend/access/shmmq/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/shmmq
+#
+# IDENTIFICATION
+#    src/backend/access/shmmq/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/shmmq
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = shmmqam.o 
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/shmmq/shmmqam.c b/src/backend/access/shmmq/shmmqam.c
new file mode 100644
index 0000000..9c57580
--- /dev/null
+++ b/src/backend/access/shmmq/shmmqam.c
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.c
+ *	  shared memory queue access method code
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/shmmq/shmmqam.c
+ *
+ *
+ * INTERFACE ROUTINES
+ *		shm_getnext	- retrieve next tuple in queue
+ *
+ * NOTES
+ *	  This file contains the shmmq_ routines which implement
+ *	  the POSTGRES shared memory access method used for all POSTGRES
+ *	  relations.
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup.h"
+#include "access/htup_details.h"
+#include "access/shmmqam.h"
+#include "fmgr.h"
+#include "libpq/libpq.h"
+#include "libpq/pqformat.h"
+#include "utils/lsyscache.h"
+
+
+
+/*
+ * ExecInitWorkerResult
+ *
+ * Initializes the result state to retrieve tuples from worker backends. 
+ */
+worker_result
+ExecInitWorkerResult(void)
+{
+	worker_result	workerResult;
+
+	workerResult = palloc0(sizeof(worker_result_state));
+
+	return workerResult;
+}
+
+/*
+ * shm_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in parallel sequential scan.
+ */
+HeapTuple
+shm_getnext(HeapScanDesc scanDesc, worker_result resultState,
+			TupleQueueFunnel *funnel, ScanDirection direction,
+			bool *fromheap)
+{
+	HeapTuple	tup;
+
+	while (!resultState->all_workers_done || !resultState->local_scan_done)
+	{
+		if (!resultState->all_workers_done)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnel, !resultState->local_scan_done,
+									   &resultState->all_workers_done);
+			if (HeapTupleIsValid(tup))
+			{
+				*fromheap = false;
+				return tup;
+			}
+		}
+		if (!resultState->local_scan_done)
+		{
+			tup = heap_getnext(scanDesc, direction);
+			if (HeapTupleIsValid(tup))
+			{
+				*fromheap = true;
+				return tup;
+			}
+			resultState->local_scan_done = true;
+		}
+	}
+
+	return NULL;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a951c55..8410afa 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -916,6 +917,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1065,6 +1069,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1206,6 +1211,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((FunnelState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((FunnelState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1331,6 +1354,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2214,6 +2245,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..991ff51 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -16,14 +16,15 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
+       nodeSeqscan.o nodePartialSeqscan.o nodeSetOp.o nodeSort.o \
+       nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o \
+       spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 6ebad2f..268ee3f 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -155,6 +156,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -458,6 +463,11 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		/* Backward scan is not suppotted for parallel sequiantel scan. */
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index 1c8be25..f13b7bcb 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 07526e8..9a3e285 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -181,6 +181,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		estate->es_param_exec_vals = (ParamExecData *)
 			palloc0(queryDesc->plannedstmt->nParamExec * sizeof(ParamExecData));
 
+	estate->toc = queryDesc->toc;
+
 	/*
 	 * If non-read-only query, set the command ID to mark output tuples with
 	 */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..1a1275c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +192,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +418,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +664,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 3f0d809..7916ea3 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -191,13 +191,20 @@ ExecScan(ScanState *node,
 		 * check for non-nil qual here to avoid a function call to ExecQual()
 		 * when the qual is nil ... saves only a few cycles, but they add up
 		 * ...
+		 *
+		 * check for non-heap tuples (can get such tuples from shared memory
+		 * message queue's in case of parallel query), for such tuples no need
+		 * to perform qualification or projection as for them the same is done
+		 * by worker backend.  This case will happen only for parallel query
+		 * where we push down the qualification and projection (targetlist)
+		 * information.
 		 */
-		if (!qual || ExecQual(qual, econtext, false))
+		if (!slot->tts_fromheap || !qual || ExecQual(qual, econtext, false))
 		{
 			/*
 			 * Found a satisfactory scan tuple.
 			 */
-			if (projInfo)
+			if (projInfo && slot->tts_fromheap)
 			{
 				/*
 				 * Form a projection tuple, store it in the result tuple slot
@@ -211,6 +218,23 @@ ExecScan(ScanState *node,
 					return resultSlot;
 				}
 			}
+			else if (projInfo && !slot->tts_fromheap)
+			{
+				/*
+				 * Store the tuple we got from shared memory tuple queue
+				 * in projection slot as the worker backend takes care
+				 * of doing projection.  We don't need to free this tuple
+				 * as this is pointing to scan tuple slot which will take
+				 * care of freeing it.
+				 */
+				ExecStoreTuple(econtext->ecxt_scantuple->tts_tuple,	/* tuple to store */
+							   projInfo->pi_slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   false);	/* pfree this pointer */
+
+				return projInfo->pi_slot;
+			}
 			else
 			{
 				/*
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 753754d..4c5bd88 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -123,6 +123,7 @@ MakeTupleTableSlot(void)
 	slot->tts_values = NULL;
 	slot->tts_isnull = NULL;
 	slot->tts_mintuple = NULL;
+	slot->tts_fromheap	= true;
 
 	return slot;
 }
@@ -473,6 +474,8 @@ ExecClearTuple(TupleTableSlot *slot)	/* slot in which to store tuple */
 	slot->tts_isempty = true;
 	slot->tts_nvalid = 0;
 
+	slot->tts_fromheap = true;
+
 	return slot;
 }
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 022041b..79eeaee 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -145,6 +145,8 @@ CreateExecutorState(void)
 
 	estate->es_auxmodifytables = NIL;
 
+	estate->toc = NULL;
+
 	estate->es_per_tuple_exprcontext = NULL;
 
 	estate->es_epqTuple = NULL;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..56e509d 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,8 @@ BufferUsage pgBufferUsage;
 
 static void BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add, const BufferUsage *sub);
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 
 /* Allocate new instrumentation structure(s) */
@@ -127,6 +129,28 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
 static void
 BufferUsageAccumDiff(BufferUsage *dst,
@@ -148,3 +172,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..74e1e44
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,423 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		FunnelNext				retrieve next tuple from either heap or shared memory segment.
+ *		ExecInitFunnel			creates and initializes a parallel seqscan node.
+ *		ExecEndFunnel			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeFunnel.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		FunnelNext
+ *
+ *		This is a workhorse for ExecFunnel
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+FunnelNext(FunnelState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+	bool			fromheap = true;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table based on result tuple descriptor.
+	 */
+	tuple = shm_getnext(scandesc,
+						node->pss_workerResult,
+						node->funnel,
+						direction,
+						&fromheap);
+
+	slot->tts_fromheap = fromheap;
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass '!fromheap'
+	 * because tuples returned by shm_getnext() are either pointers that are
+	 * created with palloc() or are pointers onto disk pages and so it should
+	 * be pfree()'d accordingly.  Note also that ExecStoreTuple will increment
+	 * the refcount of the buffer; the refcount will not be dropped until the
+	 * tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   fromheap ? scandesc->rs_cbuf : InvalidBuffer, /* buffer associated with this
+																	  * tuple */
+					   !fromheap);	/* pfree this pointer if not from heap */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * FunnelRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+FunnelRecheck(SeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, Funnel never use keys in
+	 * heap_beginscan (and this is very bad) - so, here
+	 * we do not check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitFunnelRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnelRelation(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+	ParallelHeapScanDesc pscan;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	 /*
+	 * For Explain statement, we don't want to initialize workers as
+	 * those are maily needed to execute the plan, however scan descriptor
+	 * still needs to be initialized for the purpose of InitNode functionality
+	 * and EndNode functionality assumes that scan descriptor and scan relation
+	 * must be initialized, probably we can change that but that will make
+	 * the code EndParallelSeqScan look different than other node's end
+	 * functionality.
+	 */
+	if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+	{
+		/* initialize a heapscan */
+		currentScanDesc = heap_beginscan(currentRelation,
+										 estate->es_snapshot,
+										 0,
+										 NULL);
+	}
+	else
+	{
+		/* Initialize the workers required to perform parallel scan. */
+		InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+									estate,
+									currentRelation,
+									&node->inst_options_space,
+									&node->responseq,
+									&node->pcxt,
+									&pscan,
+									((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		currentScanDesc = heap_beginscan_parallel(currentRelation, pscan);
+	}
+
+	node->ss.ss_currentRelation = currentRelation;
+	node->ss.ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnelRelation(funnelstate, estate, eflags);
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	funnelstate->pss_workerResult = ExecInitWorkerResult();
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+
+	/*
+	 * if parallel context is set and workers are not
+	 * registered, register them now.
+	 */
+	if (node->pcxt && !node->fs_workersReady)
+	{
+		/* Register backend workers. */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			 shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+			 RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+		}
+
+		node->fs_workersReady = true;
+	}
+
+	return ExecScan((ScanState *) &node->ss,
+					(ExecScanAccessMtd) FunnelNext,
+					(ExecScanRecheckMtd) FunnelRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	if (node->pcxt && node->fs_workersReady)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		ExitParallelMode();
+	}
+	else if (node->pcxt)
+	{
+		int i;
+
+		/*
+		 * We only need to free the memory allocated to initialize
+		 * parallel workers as workers are still not started.
+		 */
+		dlist_delete(&node->pcxt->node);
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].error_mqh != NULL)
+			{
+				pfree(node->pcxt->worker[i].error_mqh);
+				node->pcxt->worker[i].error_mqh = NULL;
+			}
+		}
+		
+		/*
+		 * If we have allocated a shared memory segment, detach it.  This will
+		 * implicitly detach the error queues, and any other shared memory
+		 * queues, stored there.
+		 */
+		if (node->pcxt->seg != NULL)
+			dsm_detach(node->pcxt->seg);
+
+		/* Free the worker array itself. */
+		pfree(node->pcxt->worker);
+		node->pcxt->worker = NULL;
+
+		/* Free memory. */
+		pfree(node->pcxt);
+
+		ExitParallelMode();
+	}
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	HeapScanDesc scan;
+	ParallelHeapScanDesc pscan = NULL;
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.
+	 */
+	if (node->fs_workersReady)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		/* Initialize the workers required to perform parallel scan. */
+		InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+								  estate,
+								  node->ss.ss_currentRelation,
+								  &node->inst_options_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  &pscan,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		node->fs_workersReady = false;
+
+		node->pss_workerResult->all_workers_done = 0;
+		node->pss_workerResult->local_scan_done = 0;
+	}
+
+	scan = node->ss.ss_currentScanDesc;
+
+	heap_parallel_rescan(pscan,			/* scan desc */
+						 scan);			/* new scan keys */
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..fb4efa3
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,259 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/shmmqam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodePartialSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss_currentScanDesc;
+	estate = node->ps.state;
+	direction = estate->es_direction;
+	slot = node->ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+	ParallelHeapScanDesc pscan;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.
+	 */
+	Assert(estate->toc);
+	
+	pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+	currentScanDesc = heap_beginscan_parallel(currentRelation, pscan);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ps.plan = (Plan *) node;
+	scanstate->ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ps);
+	ExecInitScanTupleSlot(estate, scanstate);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ps);
+	ExecAssignScanProjectionInfo(scanstate);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss_currentRelation;
+	scanDesc = node->ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..ee4e03e
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,272 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static void
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+			memcpy(&funnel->queue[funnel->nextqueue],
+				   &funnel->queue[funnel->nextqueue + 1],
+				   sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9fe8008..e51fc38 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -354,6 +354,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4044,6 +4081,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 775f482..3382ab2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -439,6 +439,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2886,6 +2904,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..aa278c5 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,187 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	/* sizeof(ParamListInfoData) includes the first array element */
+	size = sizeof(ParamListInfoData) +
+		(num_params - 1) * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 563209c..2bae475 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1280,6 +1280,91 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+
+	READ_DONE();
+}
+
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1409,6 +1494,12 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5a9daf0..282e5ff 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -220,6 +228,55 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info,
+			int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..0b25b39
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,115 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "nodes/relation.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cb69c03..9f084ab 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,11 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path,
+								List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +105,12 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist,
+										   List *qpqual,
+										   Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+						   Index scanrelid, int nworkers,
+						   Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +239,8 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +356,20 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path,
+											   tlist,
+											   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +573,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1162,87 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path,
+				   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Plan	   *subplan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3321,6 +3431,45 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b02a107..182c70d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,50 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_worker_scan_plannedstmt(PartialSeqScan *partialscan, List *rangetable)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ec828cd..ef8c317 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -435,6 +435,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -445,6 +446,24 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 5a1d539..8ea91ec 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2163,6 +2163,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..c1ffe78 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,53 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+							Path* subpath, int nWorkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nWorkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..28705d6
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,400 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitializeParallelWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/nodeFunnel.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "postmaster/backendworker.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space);
+static void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size);
+static void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					 char *plannedstmt_str, ParallelHeapScanDesc *pscan,
+					 Size plannedstmt_size, Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters and instrumentation information that need to be
+ * retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters and instrumentation information
+ * required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space)
+{
+	char	*paramsdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePartialSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size)
+{
+	/* Estimate space for partial seq. scan specific contents. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StorePartialSeqScan
+ * 
+ * Sets up the planned statement and block range for parallel
+ * sequence scan.
+ */
+void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					 char *plannedstmt_str, ParallelHeapScanDesc *pscan,
+					 Size plannedstmt_size, Size pscan_size)
+{
+	char		*plannedstmtdata;
+
+	/* Store range table list in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	*pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(*pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, *pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Plan *plan, EState *estate, Relation rel,
+						  char **inst_options_space,
+						  shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+						  ParallelHeapScanDesc *pscan, int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	Size		params_size, pscan_size, plannedstmt_size;
+	char	   *plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_worker_scan_plannedstmt((PartialSeqScan *)plan,
+												 estate->es_range_table);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePartialSeqScanSpace(pcxt, estate, plannedstmt_str,
+								&plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 estate->es_instrument, &params_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePartialSeqScan(pcxt, estate, rel, plannedstmt_str,
+						pscan, plannedstmt_size, pscan_size);
+
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 estate->es_instrument,
+							 params_size, inst_options_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameter's and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument)
+{
+	char		*paramsdata;
+	char		*inst_options_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+}
+
+/*
+ * GetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->qual);
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->targetlist);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	int				inst_options;
+	char			*instrument = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	GetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &inst_options, &instrument);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->toc = toc;
+	parallelstmt->responseq = responseq;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ac431e5..4c303dd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..7a9ce3e 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -129,6 +130,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +166,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +209,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +254,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ea2a432..17f322f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -55,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1191,6 +1193,80 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	if (parallelstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestTupleQueue);
+		SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	queryDesc->toc = parallelstmt->toc;
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0bbc67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -80,6 +80,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->params = params;		/* parameter values passed into query */
 	qd->instrument_options = instrument_options;		/* instrumentation
 														 * wanted? */
+	qd->toc = NULL;		/* need to be set by the caller before ExecutorStart */
 
 	/* null these fields until set by ExecutorStart */
 	qd->tupDesc = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 791543e..abc2b8f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -608,6 +608,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2537,6 +2539,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2724,6 +2736,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f8f9ce1..fbe6042 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -290,6 +290,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -500,6 +503,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fb2b5f0..d4f4e2d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
diff --git a/src/include/access/shmmqam.h b/src/include/access/shmmqam.h
new file mode 100644
index 0000000..80d06ac
--- /dev/null
+++ b/src/include/access/shmmqam.h
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * shmmqam.h
+ *	  POSTGRES shared memory queue access method definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/shmmqam.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SHMMQAM_H
+#define SHMMQAM_H
+
+#include "access/relscan.h"
+#include "executor/tqueue.h"
+#include "libpq/pqmq.h"
+
+
+/* Private state maintained across calls to shm_getnext. */
+typedef struct worker_result_state
+{
+	bool		all_workers_done;
+	bool		local_scan_done;
+} worker_result_state;
+
+typedef struct worker_result_state *worker_result;
+
+extern worker_result ExecInitWorkerResult(void);
+extern HeapTuple shm_getnext(HeapScanDesc scanDesc, worker_result resultState,
+							 TupleQueueFunnel *funnel, ScanDirection direction,
+							 bool *fromheap);
+
+#endif   /* SHMMQAM_H */
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index a2381cd..56b7c75 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,7 @@ typedef struct QueryDesc
 	DestReceiver *dest;			/* the destination for tuple output */
 	ParamListInfo params;		/* param values being passed in */
 	int			instrument_options;		/* OR of InstrumentOption flags */
+	shm_toc		*toc;			/* to fetch the information from dsm */
 
 	/* These fields are set by ExecutorStart */
 	TupleDesc	tupDesc;		/* descriptor for result tuples */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..e8522fe 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,6 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..7c6d93f
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodefunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..f02bcca
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..c979233
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,34 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 48f84bf..e5dec1e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -127,6 +127,8 @@ typedef struct TupleTableSlot
 	MinimalTuple tts_mintuple;	/* minimal tuple, or NULL if none */
 	HeapTupleData tts_minhdr;	/* workspace for minimal-tuple-only case */
 	long		tts_off;		/* saved state for slot_deform_tuple */
+	bool		tts_fromheap;	/* indicates whether the tuple is fetched from
+								   heap or shrared memory message queue */
 } TupleTableSlot;
 
 #define TTS_HAS_PHYSICAL_TUPLE(slot)  \
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 59b17f3..32e3baf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,10 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/shmmqam.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
 #include "utils/reltrigger.h"
@@ -389,6 +392,12 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1213,6 +1222,29 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * PartialSeqScan uses a bare SeqScanState as its state node, since
+ * it needs no additional fields.
+ */
+typedef SeqScanState PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		dsm_segment		dynamic shared memory segment to setup worker queues
+ *		responseq		shared memory queues to receive data from workers
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	worker_result	pss_workerResult;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	bool			fs_workersReady;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 97ef0fc..6acbe67 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +99,8 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +221,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..65b60a0 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ac13302..ea8e240 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,9 +20,16 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
+#include "nodes/params.h"
+#include "nodes/plannodes.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "nodes/params.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
+#include "storage/shm_mq.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +163,16 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	shm_toc			*toc;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+} ParallelStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f6683f0..8099f78 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -279,6 +281,22 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..df1ab5e 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -103,6 +103,8 @@ typedef struct PlannerGlobal
 
 	bool		hasRowSecurity;	/* row security applied? */
 
+	bool		parallelModeNeeded; /* parallel plans need parallelmode */
+
 } PlannerGlobal;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -737,6 +739,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..11f0409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..7873565 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,11 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..3b7ed92 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*
+create_worker_scan_plannedstmt(PartialSeqScan *partialscan, List *rangetable);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..1d05d79
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,39 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define PARALLEL_KEY_INST_OPTIONS	2
+#define PARALLEL_KEY_INST_INFO		3
+#define PARALLEL_KEY_TUPLE_QUEUE	4
+#define PARALLEL_KEY_SCAN			5
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Plan *plan, EState *estate,
+									  Relation rel, char **inst_options_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  ParallelHeapScanDesc *pscan,
+									  int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..b560672 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 3e17770..489af46 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#181Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#180)
Re: Parallel Seq Scan

On Tue, Mar 10, 2015 at 1:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Assuming previous patch is in right direction, I have enabled
join support for the patch and done some minor cleanup of
patch which leads to attached new version.

Is this patch handles the cases where the re-scan starts without
finishing the earlier scan?

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#182Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#181)
Re: Parallel Seq Scan

On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Tue, Mar 10, 2015 at 1:38 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Assuming previous patch is in right direction, I have enabled
join support for the patch and done some minor cleanup of
patch which leads to attached new version.

Is this patch handles the cases where the re-scan starts without
finishing the earlier scan?

Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)
where we scan the next outer tuple and rescan inner table without
completing the previous scan of inner table?

I have currently modelled it based on existing rescan for seqscan
(ExecReScanSeqScan()) which means it will begin the scan again.
Basically if the workers are already started/initialized by previous
scan, then re-initialize them (refer function ExecReScanFunnel() in
patch).

Can you elaborate more if you think current handling is not sufficient
for any case?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#183Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#182)
Re: Parallel Seq Scan

On Tue, Mar 10, 2015 at 3:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Tue, Mar 10, 2015 at 1:38 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Assuming previous patch is in right direction, I have enabled
join support for the patch and done some minor cleanup of
patch which leads to attached new version.

Is this patch handles the cases where the re-scan starts without
finishing the earlier scan?

Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)
where we scan the next outer tuple and rescan inner table without
completing the previous scan of inner table?

Yes.

I have currently modelled it based on existing rescan for seqscan
(ExecReScanSeqScan()) which means it will begin the scan again.
Basically if the workers are already started/initialized by previous
scan, then re-initialize them (refer function ExecReScanFunnel() in
patch).

Can you elaborate more if you think current handling is not sufficient
for any case?

From ExecReScanFunnel function it seems that the re-scan waits till
all the workers
has to be finished to start again the next scan. Are the workers will
stop the current
ongoing task? otherwise this may decrease the performance instead of
improving as i feel.

I am not sure if it already handled or not, when a worker is waiting
to pass the results,
whereas the backend is trying to start the re-scan?

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#184Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#183)
Re: Parallel Seq Scan

On Tue, Mar 10, 2015 at 10:23 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Tue, Mar 10, 2015 at 3:09 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have currently modelled it based on existing rescan for seqscan
(ExecReScanSeqScan()) which means it will begin the scan again.
Basically if the workers are already started/initialized by previous
scan, then re-initialize them (refer function ExecReScanFunnel() in
patch).

Can you elaborate more if you think current handling is not sufficient
for any case?

From ExecReScanFunnel function it seems that the re-scan waits till
all the workers
has to be finished to start again the next scan. Are the workers will
stop the current
ongoing task? otherwise this may decrease the performance instead of
improving as i feel.

Okay, performance-wise it might effect such a case, but I think we can
handle it by not calling WaitForParallelWorkersToFinish(),
as DestroyParallelContext() will automatically terminate all the workers.

I am not sure if it already handled or not, when a worker is waiting
to pass the results,
whereas the backend is trying to start the re-scan?

I think stopping/terminating workers should handle such a case.

Thanks for pointing out this case, I will change it in next update.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#185Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#179)
Re: Parallel Seq Scan

On Tue, Mar 3, 2015 at 7:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have modified the patch to introduce a Funnel node (and left child
as PartialSeqScan node). Apart from that, some other noticeable
changes based on feedback include:
a) Master backend forms and send the planned stmt to each worker,
earlier patch use to send individual elements and form the planned
stmt in each worker.
b) Passed tuples directly via tuple queue instead of going via
FE-BE protocol.
c) Removed restriction of expressions in target list.
d) Introduced a parallelmodeneeded flag in plannerglobal structure
and set it for Funnel plan.

There is still some work left like integrating with
access-parallel-safety patch (use parallelmodeok flag to decide
whether parallel path can be generated, Enter/Exit parallel mode is still
done during execution of funnel node).

I think these are minor points which can be fixed once we decide
on the other major parts of patch. Find modified patch attached with
this mail.

This is definitely progress. I do think you need to integrate it with
the access-parallel-safety patch. Other comments:

- There's not much code left in shmmqam.c. I think that the remaining
logic should be integrated directly into nodeFunnel.c, with the two
bools in worker_result_state becoming part of the FunnelState. It
doesn't make sense to have a separate structure for two booleans and
20 lines of code. If you were going to keep this file around, I'd
complain about its name and its location in the source tree, too, but
as it is I think we can just get rid of it altogether.

- Something is deeply wrong with the separation of concerns between
nodeFunnel.c and nodePartialSeqscan.c. nodeFunnel.c should work
correctly with *any arbitrary plan tree* as its left child, and that
is clearly not the case right now. shm_getnext() can't just do
heap_getnext(). Instead, it's got to call ExecProcNode() on its left
child and let the left child decide what to do about that. The logic
in InitFunnelRelation() belongs in the parallel seq scan node, not the
funnel. ExecReScanFunnel() cannot be calling heap_parallel_rescan();
it needs to *not know* that there is a parallel scan under it. The
comment in FunnelRecheck is a copy-and-paste from elsewhere that is
not applicable to a generic funnel mode.

- The comment in execAmi.c refers to says "Backward scan is not
suppotted for parallel sequiantel scan". "Sequential" is mis-spelled
here, but I think you should just nuke the whole comment. The funnel
node is not, in the long run, just for parallel sequential scan, so
putting that comment above it is not right. If you want to keep the
comment, it's got to be more general than that somehow, like "parallel
nodes do not support backward scans", but I'd just drop it.

- Can we rename create_worker_scan_plannedstmt to
create_parallel_worker_plannedstmt?

- I *strongly* suggest that, for the first version of this, we remove
all of the tts_fromheap stuff. Let's make no special provision for
returning a tuple stored in a tuple queue; instead, just copy it and
store it in the slot as a pfree-able tuple. That may be slightly less
efficient, but I think it's totally worth it to avoid the complexity
of tinkering with the slot mechanism.

- InstrAggNode claims that we only need the master's information for
statistics other than buffer usage and tuple counts, but is that
really true? The parallel backends can be working on the parallel
part of the plan while the master is doing something else, so the
amount of time the *master* spent in a particular node may not be that
relevant. We might need to think carefully about what it makes sense
to display in the EXPLAIN output in parallel cases.

- The header comment on nodeFunnel.h capitalizes the filename incorrectly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#186Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Robert Haas (#185)
Re: Parallel Seq Scan

On Wed, Mar 11, 2015 at 6:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 3, 2015 at 7:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have modified the patch to introduce a Funnel node (and left child
as PartialSeqScan node). Apart from that, some other noticeable
changes based on feedback include:
a) Master backend forms and send the planned stmt to each worker,
earlier patch use to send individual elements and form the planned
stmt in each worker.
b) Passed tuples directly via tuple queue instead of going via
FE-BE protocol.
c) Removed restriction of expressions in target list.
d) Introduced a parallelmodeneeded flag in plannerglobal structure
and set it for Funnel plan.

There is still some work left like integrating with
access-parallel-safety patch (use parallelmodeok flag to decide
whether parallel path can be generated, Enter/Exit parallel mode is still
done during execution of funnel node).

I think these are minor points which can be fixed once we decide
on the other major parts of patch. Find modified patch attached with
this mail.

- Something is deeply wrong with the separation of concerns between
nodeFunnel.c and nodePartialSeqscan.c. nodeFunnel.c should work
correctly with *any arbitrary plan tree* as its left child, and that
is clearly not the case right now. shm_getnext() can't just do
heap_getnext(). Instead, it's got to call ExecProcNode() on its left
child and let the left child decide what to do about that. The logic
in InitFunnelRelation() belongs in the parallel seq scan node, not the
funnel. ExecReScanFunnel() cannot be calling heap_parallel_rescan();
it needs to *not know* that there is a parallel scan under it. The
comment in FunnelRecheck is a copy-and-paste from elsewhere that is
not applicable to a generic funnel mode.

In create_parallelscan_paths() function the funnel path is added once
the partial seq scan
path is generated. I feel the funnel path can be added once on top of
the total possible
parallel path in the entire query path.

Is this the right patch to add such support also?

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#187Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#182)
Re: Parallel Seq Scan

On 10-03-2015 PM 01:09, Amit Kapila wrote:

On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>

Is this patch handles the cases where the re-scan starts without
finishing the earlier scan?

Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)
where we scan the next outer tuple and rescan inner table without
completing the previous scan of inner table?

I have currently modelled it based on existing rescan for seqscan
(ExecReScanSeqScan()) which means it will begin the scan again.
Basically if the workers are already started/initialized by previous
scan, then re-initialize them (refer function ExecReScanFunnel() in
patch).

From Robert's description[1]/messages/by-id/CA+TgmobM7X6jgre442638b+33h1EWa=vcZqnsvzEdX057ZHVuw@mail.gmail.com, it looked like the NestLoop with Funnel would
have Funnel as either outer plan or topmost plan node or NOT a parameterised
plan. In that case, would this case arise or am I missing something?

Thanks,
Amit

[1]: /messages/by-id/CA+TgmobM7X6jgre442638b+33h1EWa=vcZqnsvzEdX057ZHVuw@mail.gmail.com
/messages/by-id/CA+TgmobM7X6jgre442638b+33h1EWa=vcZqnsvzEdX057ZHVuw@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#188Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#185)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Mar 11, 2015 at 1:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 3, 2015 at 7:47 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

There is still some work left like integrating with
access-parallel-safety patch (use parallelmodeok flag to decide
whether parallel path can be generated, Enter/Exit parallel mode is

still

done during execution of funnel node).

I think these are minor points which can be fixed once we decide
on the other major parts of patch. Find modified patch attached with
this mail.

This is definitely progress. I do think you need to integrate it with
the access-parallel-safety patch.

I have tried, but there are couple of failures while applying latest
access-parallel-safety patch, so left it as it is for now.

Other comments:

- There's not much code left in shmmqam.c. I think that the remaining
logic should be integrated directly into nodeFunnel.c, with the two
bools in worker_result_state becoming part of the FunnelState. It
doesn't make sense to have a separate structure for two booleans and
20 lines of code. If you were going to keep this file around, I'd
complain about its name and its location in the source tree, too, but
as it is I think we can just get rid of it altogether.

Agreed. Moved the code/logic to nodeFunnel.c

- Something is deeply wrong with the separation of concerns between
nodeFunnel.c and nodePartialSeqscan.c. nodeFunnel.c should work
correctly with *any arbitrary plan tree* as its left child, and that
is clearly not the case right now. shm_getnext() can't just do
heap_getnext(). Instead, it's got to call ExecProcNode() on its left
child and let the left child decide what to do about that.

Agreed and made the required changes.

The logic
in InitFunnelRelation() belongs in the parallel seq scan node, not the
funnel.

I think we should retain initialization of parallelcontext in InitFunnel().
Apart from that, I have moved other stuff to partial seq scan node.

ExecReScanFunnel() cannot be calling heap_parallel_rescan();
it needs to *not know* that there is a parallel scan under it.

Agreed. I think it is better to be do that as part of partial seq scan
node.

The
comment in FunnelRecheck is a copy-and-paste from elsewhere that is
not applicable to a generic funnel mode.

With new changes, this API is not required.

- The comment in execAmi.c refers to says "Backward scan is not
suppotted for parallel sequiantel scan". "Sequential" is mis-spelled
here, but I think you should just nuke the whole comment. The funnel
node is not, in the long run, just for parallel sequential scan, so
putting that comment above it is not right. If you want to keep the
comment, it's got to be more general than that somehow, like "parallel
nodes do not support backward scans", but I'd just drop it.

- Can we rename create_worker_scan_plannedstmt to
create_parallel_worker_plannedstmt?

Agreed and changed as per suggestion.

- I *strongly* suggest that, for the first version of this, we remove
all of the tts_fromheap stuff. Let's make no special provision for
returning a tuple stored in a tuple queue; instead, just copy it and
store it in the slot as a pfree-able tuple. That may be slightly less
efficient, but I think it's totally worth it to avoid the complexity
of tinkering with the slot mechanism.

Sure, removed (tts_fromheap becomes redundant with new changes).

- InstrAggNode claims that we only need the master's information for
statistics other than buffer usage and tuple counts, but is that
really true? The parallel backends can be working on the parallel
part of the plan while the master is doing something else, so the
amount of time the *master* spent in a particular node may not be that
relevant.

Yes, but isn't other nodes also work this way, example join node will
display the accumulated stats for buffer usage, but for timing, it will
just use the time for that node (which automatically includes some
part of execution of child nodes, but it is not direct accumulation)?

We might need to think carefully about what it makes sense
to display in the EXPLAIN output in parallel cases.

Currently the Explain for parallel scan on relation will display the
Funnel node which contains aggregated stat of all workers and the
number of workers and Partial Seq Scan node containing stats for
the scan done by master backend. Do we want to display something
more?

Current result of Explain statement is as below:
postgres=# explain (analyze,buffers) select c1 from t1 where c1 > 90000;
QUERY PLAN

--------------------------------------------------------------------------------
-------------------------------------------
Funnel on t1 (cost=0.00..43750.44 rows=9905 width=4) (actual
time=1097.236..15
30.416 rows=10000 loops=1)
Filter: (c1 > 90000)
Rows Removed by Filter: 65871
Number of Workers: 2
Buffers: shared hit=96 read=99905
-> Partial Seq Scan on t1 (cost=0.00..101251.01 rows=9905 width=4)
(actual
time=1096.188..1521.810 rows=2342 loops=1)
Filter: (c1 > 90000)
Rows Removed by Filter: 24130
Buffers: shared hit=33 read=26439
Planning time: 0.143 ms
Execution time: 1533.438 ms
(11 rows)

- The header comment on nodeFunnel.h capitalizes the filename incorrectly.

Changed.

One additional change (we need to SetLatch()
in HandleParallelMessageInterrupt)
is done to handle the hang issue reported on parallel-mode thread.
Without this change it is difficult to verify the patch (will remove this
change
once new version of parallel-mode patch containing this change will be
posted).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v10.patchapplication/octet-stream; name=parallel_seqscan_v10.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 383e15b..d384e8f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1582,6 +1582,20 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8d39bf2..d2817b2 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -535,6 +535,8 @@ HandleParallelMessageInterrupt(void)
 	InterruptPending = true;
 	ParallelMessagePending = true;
 
+	SetLatch(MyLatch);
+
 	errno = save_errno;
 }
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a951c55..f8acfc8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,8 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -916,6 +918,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1065,6 +1073,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1206,6 +1216,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((FunnelState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((FunnelState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1322,6 +1350,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
 			break;
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -1331,6 +1360,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2214,6 +2251,8 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..991ff51 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -16,14 +16,15 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
+       nodeSeqscan.o nodePartialSeqscan.o nodeSetOp.o nodeSort.o \
+       nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o \
+       spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 6ebad2f..10dc319 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -37,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -155,6 +157,14 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -458,6 +468,10 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index 1c8be25..f13b7bcb 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 07526e8..9a3e285 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -181,6 +181,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		estate->es_param_exec_vals = (ParamExecData *)
 			palloc0(queryDesc->plannedstmt->nParamExec * sizeof(ParamExecData));
 
+	estate->toc = queryDesc->toc;
+
 	/*
 	 * If non-read-only query, set the command ID to mark output tuples with
 	 */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..1a1275c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +192,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +418,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +664,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 022041b..79eeaee 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -145,6 +145,8 @@ CreateExecutorState(void)
 
 	estate->es_auxmodifytables = NIL;
 
+	estate->toc = NULL;
+
 	estate->es_per_tuple_exprcontext = NULL;
 
 	estate->es_epqTuple = NULL;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..56e509d 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,8 @@ BufferUsage pgBufferUsage;
 
 static void BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add, const BufferUsage *sub);
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 
 /* Allocate new instrumentation structure(s) */
@@ -127,6 +129,28 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
 static void
 BufferUsageAccumDiff(BufferUsage *dst,
@@ -148,3 +172,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..23f0245
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,376 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		rescans a relation
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeFunnel.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/* Initialize the workers required to perform parallel scan. */
+	InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+							  estate,
+							  currentRelation,
+							  &node->inst_options_space,
+							  &node->responseq,
+							  &node->pcxt,
+							  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+	estate->toc = node->pcxt->toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	/* Initialize scan state of workers. */
+	funnelstate->all_workers_done = false;
+	funnelstate->local_scan_done = false;
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+
+	/*
+	 * if parallel context is set and workers are not
+	 * registered, register them now.
+	 */
+	if (node->pcxt && !node->fs_workersReady)
+	{
+		/* Register backend workers. */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			 shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+			 RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+		}
+
+		node->fs_workersReady = true;
+	}
+	
+	return funnel_getnext(node);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+
+	if (node->pcxt && node->fs_workersReady)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		ExitParallelMode();
+	}
+	else if (node->pcxt)
+	{
+		int i;
+
+		/*
+		 * We only need to free the memory allocated to initialize
+		 * parallel workers as workers are still not started.
+		 */
+		dlist_delete(&node->pcxt->node);
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].error_mqh != NULL)
+			{
+				pfree(node->pcxt->worker[i].error_mqh);
+				node->pcxt->worker[i].error_mqh = NULL;
+			}
+		}
+		
+		/*
+		 * If we have allocated a shared memory segment, detach it.  This will
+		 * implicitly detach the error queues, and any other shared memory
+		 * queues, stored there.
+		 */
+		if (node->pcxt->seg != NULL)
+			dsm_detach(node->pcxt->seg);
+
+		/* Free the worker array itself. */
+		pfree(node->pcxt->worker);
+		node->pcxt->worker = NULL;
+
+		/* Free memory. */
+		pfree(node->pcxt);
+
+		ExitParallelMode();
+	}
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans a relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.
+	 */
+	if (node->fs_workersReady)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		/* Initialize the workers required to perform parallel scan. */
+		InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+								  estate,
+								  node->ss.ss_currentRelation,
+								  &node->inst_options_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+
+	estate->toc = node->pcxt->toc;
+
+	ExecReScan(node->ss.ps.lefttree);
+}
+
+/*
+ * funnel_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in funnel scan and if there is no
+ *  data available from queues, it does fetch the data from local
+ *	node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while (!funnelstate->all_workers_done || !funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..55aa266
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodePartialSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss_currentScanDesc;
+	estate = node->ps.state;
+	direction = estate->es_direction;
+	slot = node->ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+	ParallelHeapScanDesc pscan;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.
+	 */
+	Assert(estate->toc);
+	
+	pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+	currentScanDesc = heap_beginscan_parallel(currentRelation, pscan);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ps.plan = (Plan *) node;
+	scanstate->ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ps);
+	ExecInitScanTupleSlot(estate, scanstate);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ps);
+	ExecAssignScanProjectionInfo(scanstate);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss_currentRelation;
+	scanDesc = node->ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	HeapScanDesc scan;
+	ParallelHeapScanDesc pscan;
+	EState	   *estate = node->ps.state;
+
+	Assert(estate->toc);
+	
+	pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+	scan = node->ss_currentScanDesc;
+
+	heap_parallel_rescan(pscan,			/* scan desc */
+						 scan);			/* new scan keys */
+
+	ExecScanReScan((ScanState *) node);
+}
\ No newline at end of file
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..ee4e03e
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,272 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static void
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+			memcpy(&funnel->queue[funnel->nextqueue],
+				   &funnel->queue[funnel->nextqueue + 1],
+				   sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ebb6f3a..c1d77e0 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -354,6 +354,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4047,6 +4084,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 775f482..3382ab2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -439,6 +439,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2886,6 +2904,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..aa278c5 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,187 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	/* sizeof(ParamListInfoData) includes the first array element */
+	size = sizeof(ParamListInfoData) +
+		(num_params - 1) * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 563209c..2bae475 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1280,6 +1280,91 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+
+	READ_DONE();
+}
+
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1409,6 +1494,12 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5a9daf0..282e5ff 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -220,6 +228,55 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info,
+			int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..0b25b39
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,115 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "nodes/relation.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ *	check_simple_qual -
+ *		Check if qual is made only of simple things we can
+ *		hand out directly to backend worker for execution.
+ *
+ *		XXX - Currently we don't allow to push an expression
+ *		if it contains volatile function, however eventually we
+ *		need a mechanism (proisparallel) with which we can distinquish
+ *		the functions that can be pushed for execution by parallel
+ *		worker.
+ */
+static bool
+check_simple_qual(Node *node)
+{
+	if (node == NULL)
+		return TRUE;
+
+	if (contain_volatile_functions(node))
+		return FALSE;
+
+	return TRUE;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0.
+	 */
+	if (parallel_seqscan_degree <= 0)
+		return;
+
+	/* parallel scan is supportted only for Select statements. */
+	if (root->parse->commandType != CMD_SELECT)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * parallel scan is not supported for mutable functions
+	 */
+	if (!check_simple_qual((Node*) extract_actual_clauses(rel->baserestrictinfo, false)))
+		return;
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cb69c03..32cefe6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,11 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path,
+								List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +105,12 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist,
+										   List *qpqual,
+										   Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+						   Index scanrelid, int nworkers,
+						   Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +239,8 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +356,20 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path,
+											   tlist,
+											   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +573,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1162,84 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path,
+				   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Plan	   *subplan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3321,6 +3428,45 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b02a107..590d0df 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -260,6 +260,51 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+								   List *rangetable)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ec828cd..ef8c317 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -435,6 +435,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -445,6 +446,24 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 5a1d539..8ea91ec 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2163,6 +2163,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..c1ffe78 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,53 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+							Path* subpath, int nWorkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nWorkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..b3eb876
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,401 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitializeParallelWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/nodeFunnel.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "postmaster/backendworker.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space);
+static void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size);
+static void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters and instrumentation information that need to be
+ * retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters and instrumentation information
+ * required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space)
+{
+	char	*paramsdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePartialSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size)
+{
+	/* Estimate space for partial seq. scan specific contents. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StorePartialSeqScan
+ * 
+ * Sets up the planned statement and block range for parallel
+ * sequence scan.
+ */
+void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size)
+{
+	char		*plannedstmtdata;
+	ParallelHeapScanDesc pscan;
+
+	/* Store range table list in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Plan *plan, EState *estate, Relation rel,
+						  char **inst_options_space,
+						  shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	bool		already_in_parallel_mode = IsInParallelMode();
+	Size		params_size, pscan_size, plannedstmt_size;
+	char	   *plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt((PartialSeqScan *)plan,
+													 estate->es_range_table);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePartialSeqScanSpace(pcxt, estate, plannedstmt_str,
+								&plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 estate->es_instrument, &params_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePartialSeqScan(pcxt, estate, rel, plannedstmt_str,
+						plannedstmt_size, pscan_size);
+
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 estate->es_instrument,
+							 params_size, inst_options_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameter's and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument)
+{
+	char		*paramsdata;
+	char		*inst_options_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+}
+
+/*
+ * GetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->qual);
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->targetlist);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	int				inst_options;
+	char			*instrument = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	GetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &inst_options, &instrument);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->toc = toc;
+	parallelstmt->responseq = responseq;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ac431e5..4c303dd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..7a9ce3e 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -129,6 +130,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +166,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +209,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +254,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ea2a432..17f322f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -55,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1191,6 +1193,80 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	if (parallelstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestTupleQueue);
+		SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	queryDesc->toc = parallelstmt->toc;
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0bbc67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -80,6 +80,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->params = params;		/* parameter values passed into query */
 	qd->instrument_options = instrument_options;		/* instrumentation
 														 * wanted? */
+	qd->toc = NULL;		/* need to be set by the caller before ExecutorStart */
 
 	/* null these fields until set by ExecutorStart */
 	qd->tupDesc = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 791543e..abc2b8f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -608,6 +608,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2537,6 +2539,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2724,6 +2736,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f8f9ce1..fbe6042 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -290,6 +290,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -500,6 +503,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fb2b5f0..d4f4e2d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index a2381cd..56b7c75 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,7 @@ typedef struct QueryDesc
 	DestReceiver *dest;			/* the destination for tuple output */
 	ParamListInfo params;		/* param values being passed in */
 	int			instrument_options;		/* OR of InstrumentOption flags */
+	shm_toc		*toc;			/* to fetch the information from dsm */
 
 	/* These fields are set by ExecutorStart */
 	TupleDesc	tupDesc;		/* descriptor for result tuples */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..e8522fe 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,6 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..3af3a0e
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cb05be7
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..c979233
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,34 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 59b17f3..93eab5d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
 #include "utils/reltrigger.h"
@@ -389,6 +391,12 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1213,6 +1221,37 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * PartialSeqScan uses a bare SeqScanState as its state node, since
+ * it needs no additional fields.
+ */
+typedef SeqScanState PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers. 
+ *		inst_options_space	to retrieve instrumentation information.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 38469ef..3f3d572 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +99,8 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +221,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..65b60a0 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 497559d..3f113b1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,9 +20,16 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
+#include "nodes/params.h"
+#include "nodes/plannodes.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "nodes/params.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
+#include "storage/shm_mq.h"
 #include "utils/lockwaitpolicy.h"
 
 /* Possible sources of a Query */
@@ -156,6 +163,16 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	shm_toc			*toc;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+} ParallelStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f6683f0..8099f78 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -18,6 +18,8 @@
 #include "lib/stringinfo.h"
 #include "nodes/bitmapset.h"
 #include "nodes/primnodes.h"
+#include "storage/block.h"
+#include "storage/shm_toc.h"
 #include "utils/lockwaitpolicy.h"
 
 
@@ -279,6 +281,22 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6845a40..21357be 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -737,6 +737,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..11f0409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..7873565 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,11 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..7bc7d7e 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+											List *rangetable);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..fe428eb
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,38 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define PARALLEL_KEY_INST_OPTIONS	2
+#define PARALLEL_KEY_INST_INFO		3
+#define PARALLEL_KEY_TUPLE_QUEUE	4
+#define PARALLEL_KEY_SCAN			5
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Plan *plan, EState *estate,
+									  Relation rel, char **inst_options_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..b560672 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 3e17770..489af46 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#189Thom Brown
thom@linux.com
In reply to: Amit Kapila (#188)
Re: Parallel Seq Scan

On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

One additional change (we need to SetLatch() in
HandleParallelMessageInterrupt)
is done to handle the hang issue reported on parallel-mode thread.
Without this change it is difficult to verify the patch (will remove this
change
once new version of parallel-mode patch containing this change will be
posted).

Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
getting this error when building:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
In file included from ../../../../src/include/nodes/execnodes.h:18:0,
from ../../../../src/include/access/brin.h:14,
from brin.c:18:
../../../../src/include/access/heapam.h:119:34: error: unknown type
name ‘ParallelHeapScanDesc’
extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
HeapScanDesc scan);
^

Am I missing another patch here?

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#190Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#189)
Re: Parallel Seq Scan

On Thu, Mar 12, 2015 at 8:33 PM, Thom Brown <thom@linux.com> wrote:

On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

One additional change (we need to SetLatch() in
HandleParallelMessageInterrupt)
is done to handle the hang issue reported on parallel-mode thread.
Without this change it is difficult to verify the patch (will remove

this

change
once new version of parallel-mode patch containing this change will be
posted).

Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
getting this error when building:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
In file included from ../../../../src/include/nodes/execnodes.h:18:0,
from ../../../../src/include/access/brin.h:14,
from brin.c:18:
../../../../src/include/access/heapam.h:119:34: error: unknown type
name ‘ParallelHeapScanDesc’
extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
HeapScanDesc scan);
^

Am I missing another patch here?

Yes, the below parallel-heap-scan patch.
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Please note that parallel_setup_cost and parallel_startup_cost are
still set to zero by default, so you need to set it to higher values
if you don't want the parallel plans once parallel_seqscan_degree
is set. I have yet to comeup with default values for them, needs
some tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#191Thom Brown
thom@linux.com
In reply to: Amit Kapila (#190)
Re: Parallel Seq Scan

On 12 March 2015 at 15:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2015 at 8:33 PM, Thom Brown <thom@linux.com> wrote:

On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

One additional change (we need to SetLatch() in
HandleParallelMessageInterrupt)
is done to handle the hang issue reported on parallel-mode thread.
Without this change it is difficult to verify the patch (will remove
this
change
once new version of parallel-mode patch containing this change will be
posted).

Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
getting this error when building:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
In file included from ../../../../src/include/nodes/execnodes.h:18:0,
from ../../../../src/include/access/brin.h:14,
from brin.c:18:
../../../../src/include/access/heapam.h:119:34: error: unknown type
name ‘ParallelHeapScanDesc’
extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
HeapScanDesc scan);
^

Am I missing another patch here?

Yes, the below parallel-heap-scan patch.
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Please note that parallel_setup_cost and parallel_startup_cost are
still set to zero by default, so you need to set it to higher values
if you don't want the parallel plans once parallel_seqscan_degree
is set. I have yet to comeup with default values for them, needs
some tests.

Thanks. Getting a problem:

createdb pgbench
pgbench -i -s 200 pgbench

CREATE TABLE pgbench_accounts_1 (CHECK (bid = 1)) INHERITS (pgbench_accounts);
...
CREATE TABLE pgbench_accounts_200 (CHECK (bid = 200)) INHERITS
(pgbench_accounts);

WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 1 RETURNING *)
INSERT INTO pgbench_accounts_1 SELECT * FROM del;
...
WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 200 RETURNING *)
INSERT INTO pgbench_accounts_200 SELECT * FROM del;

VACUUM ANALYSE;

# SELECT name, setting FROM pg_settings WHERE name IN
('parallel_seqscan_degree','max_worker_processes','seq_page_cost');
name | setting
-------------------------+---------
max_worker_processes | 20
parallel_seqscan_degree | 8
seq_page_cost | 1000
(3 rows)

# EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
ERROR: too many dynamic shared memory segments

And separately, I've seen this in the logs:

2015-03-12 16:09:30 GMT [7880]: [4-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [5-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [6-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [7-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [8-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [9-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [10-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [11-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [12-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [13-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [14-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [15-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [16-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [17-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [18-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [19-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [20-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7913) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [21-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [22-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7919) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [23-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [24-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7916) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [25-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [26-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7918) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [27-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [28-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7917) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [29-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [30-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7914) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [31-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [32-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7915) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [33-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [34-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7912) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [35-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [36-1] user=,db=,client= LOG: server
process (PID 7889) was terminated by signal 11: Segmentation fault
2015-03-12 16:09:30 GMT [7880]: [37-1] user=,db=,client= DETAIL:
Failed process was running: SELECT pg_catalog.quote_ident(c.relname)
FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm',
'f') AND substring(pg_catalog.quote_ident(c.relname),1,10)='pgbench_br'
AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <>
(SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog')
UNION
SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM
pg_catalog.pg_namespace n WHERE
substring(pg_catalog.quote_ident(n.nspname) || '.',1,10)='pgbench_br'
AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
substring(pg_catalog.quote_ident(nspname) || '.',1,10) =
substring('pgbench_br',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1))

1

UNION
SELECT pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c,
pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind
IN ('r', 'S', 'v', 'm', 'f') AND
substring(pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname),1,10)='pgbench_br' AND substri
2015-03-12 16:09:30 GMT [7880]: [38-1] user=,db=,client= LOG:
terminating any other active server processes
2015-03-12 16:09:30 GMT [7886]: [2-1] user=,db=,client= WARNING:
terminating connection because of crash of another server process
2015-03-12 16:09:30 GMT [7886]: [3-1] user=,db=,client= DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
2015-03-12 16:09:30 GMT [7886]: [4-1] user=,db=,client= HINT: In a
moment you should be able to reconnect to the database and repeat your
command.
2015-03-12 16:09:30 GMT [7880]: [39-1] user=,db=,client= LOG: all
server processes terminated; reinitializing
2015-03-12 16:09:30 GMT [7920]: [1-1] user=,db=,client= LOG: database
system was interrupted; last known up at 2015-03-12 16:07:26 GMT
2015-03-12 16:09:30 GMT [7920]: [2-1] user=,db=,client= LOG: database
system was not properly shut down; automatic recovery in progress
2015-03-12 16:09:30 GMT [7920]: [3-1] user=,db=,client= LOG: invalid
record length at 2/7E269A0
2015-03-12 16:09:30 GMT [7920]: [4-1] user=,db=,client= LOG: redo is
not required
2015-03-12 16:09:30 GMT [7880]: [40-1] user=,db=,client= LOG:
database system is ready to accept connections
2015-03-12 16:09:30 GMT [7924]: [1-1] user=,db=,client= LOG:
autovacuum launcher started

I can recreate this by typing:

EXPLAIN SELECT DISTINCT bid FROM pgbench_<tab>

This happens with seq_page_cost = 1000, but not when it's set to 1.

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#192Thom Brown
thom@linux.com
In reply to: Thom Brown (#191)
Re: Parallel Seq Scan

On 12 March 2015 at 16:20, Thom Brown <thom@linux.com> wrote:

On 12 March 2015 at 15:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2015 at 8:33 PM, Thom Brown <thom@linux.com> wrote:

On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

One additional change (we need to SetLatch() in
HandleParallelMessageInterrupt)
is done to handle the hang issue reported on parallel-mode thread.
Without this change it is difficult to verify the patch (will remove
this
change
once new version of parallel-mode patch containing this change will be
posted).

Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
getting this error when building:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
In file included from ../../../../src/include/nodes/execnodes.h:18:0,
from ../../../../src/include/access/brin.h:14,
from brin.c:18:
../../../../src/include/access/heapam.h:119:34: error: unknown type
name ‘ParallelHeapScanDesc’
extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
HeapScanDesc scan);
^

Am I missing another patch here?

Yes, the below parallel-heap-scan patch.
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Please note that parallel_setup_cost and parallel_startup_cost are
still set to zero by default, so you need to set it to higher values
if you don't want the parallel plans once parallel_seqscan_degree
is set. I have yet to comeup with default values for them, needs
some tests.

Thanks. Getting a problem:

createdb pgbench
pgbench -i -s 200 pgbench

CREATE TABLE pgbench_accounts_1 (CHECK (bid = 1)) INHERITS (pgbench_accounts);
...
CREATE TABLE pgbench_accounts_200 (CHECK (bid = 200)) INHERITS
(pgbench_accounts);

WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 1 RETURNING *)
INSERT INTO pgbench_accounts_1 SELECT * FROM del;
...
WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 200 RETURNING *)
INSERT INTO pgbench_accounts_200 SELECT * FROM del;

VACUUM ANALYSE;

# SELECT name, setting FROM pg_settings WHERE name IN
('parallel_seqscan_degree','max_worker_processes','seq_page_cost');
name | setting
-------------------------+---------
max_worker_processes | 20
parallel_seqscan_degree | 8
seq_page_cost | 1000
(3 rows)

# EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
ERROR: too many dynamic shared memory segments

And separately, I've seen this in the logs:

2015-03-12 16:09:30 GMT [7880]: [4-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [5-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [6-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [7-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [8-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [9-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [10-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [11-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [12-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [13-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [14-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [15-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [16-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [17-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [18-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [19-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [20-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7913) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [21-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [22-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7919) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [23-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [24-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7916) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [25-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [26-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7918) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [27-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [28-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7917) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [29-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [30-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7914) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [31-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [32-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7915) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [33-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [34-1] user=,db=,client= LOG: worker
process: parallel worker for PID 7889 (PID 7912) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [35-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [36-1] user=,db=,client= LOG: server
process (PID 7889) was terminated by signal 11: Segmentation fault
2015-03-12 16:09:30 GMT [7880]: [37-1] user=,db=,client= DETAIL:
Failed process was running: SELECT pg_catalog.quote_ident(c.relname)
FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm',
'f') AND substring(pg_catalog.quote_ident(c.relname),1,10)='pgbench_br'
AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <>
(SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog')
UNION
SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM
pg_catalog.pg_namespace n WHERE
substring(pg_catalog.quote_ident(n.nspname) || '.',1,10)='pgbench_br'
AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
substring(pg_catalog.quote_ident(nspname) || '.',1,10) =
substring('pgbench_br',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1))

1

UNION
SELECT pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c,
pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind
IN ('r', 'S', 'v', 'm', 'f') AND
substring(pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname),1,10)='pgbench_br' AND substri
2015-03-12 16:09:30 GMT [7880]: [38-1] user=,db=,client= LOG:
terminating any other active server processes
2015-03-12 16:09:30 GMT [7886]: [2-1] user=,db=,client= WARNING:
terminating connection because of crash of another server process
2015-03-12 16:09:30 GMT [7886]: [3-1] user=,db=,client= DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
2015-03-12 16:09:30 GMT [7886]: [4-1] user=,db=,client= HINT: In a
moment you should be able to reconnect to the database and repeat your
command.
2015-03-12 16:09:30 GMT [7880]: [39-1] user=,db=,client= LOG: all
server processes terminated; reinitializing
2015-03-12 16:09:30 GMT [7920]: [1-1] user=,db=,client= LOG: database
system was interrupted; last known up at 2015-03-12 16:07:26 GMT
2015-03-12 16:09:30 GMT [7920]: [2-1] user=,db=,client= LOG: database
system was not properly shut down; automatic recovery in progress
2015-03-12 16:09:30 GMT [7920]: [3-1] user=,db=,client= LOG: invalid
record length at 2/7E269A0
2015-03-12 16:09:30 GMT [7920]: [4-1] user=,db=,client= LOG: redo is
not required
2015-03-12 16:09:30 GMT [7880]: [40-1] user=,db=,client= LOG:
database system is ready to accept connections
2015-03-12 16:09:30 GMT [7924]: [1-1] user=,db=,client= LOG:
autovacuum launcher started

I can recreate this by typing:

EXPLAIN SELECT DISTINCT bid FROM pgbench_<tab>

This happens with seq_page_cost = 1000, but not when it's set to 1.

Another problem. I restarted the instance (just in case), and get this error:

# \df+ *.*
ERROR: cannot retain locks acquired while in parallel mode

I get this even with seq_page_cost = 1, parallel_seqscan_degree = 1
and max_worker_processes = 1.
--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#193Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#187)
Re: Parallel Seq Scan

On Thu, Mar 12, 2015 at 4:22 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:

On 10-03-2015 PM 01:09, Amit Kapila wrote:

On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <

kommi.haribabu@gmail.com>

Is this patch handles the cases where the re-scan starts without
finishing the earlier scan?

Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)
where we scan the next outer tuple and rescan inner table without
completing the previous scan of inner table?

I have currently modelled it based on existing rescan for seqscan
(ExecReScanSeqScan()) which means it will begin the scan again.
Basically if the workers are already started/initialized by previous
scan, then re-initialize them (refer function ExecReScanFunnel() in
patch).

From Robert's description[1], it looked like the NestLoop with Funnel

would

have Funnel as either outer plan or topmost plan node or NOT a

parameterised

plan. In that case, would this case arise or am I missing something?

Probably not if the costing is right and user doesn't manually disable
plans (like by set enable_* = off). However we should have rescan code
incase it chooses the plan such that Funnel is inner node and I think
apart from that also in few cases Rescan is required.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#194Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#193)
Re: Parallel Seq Scan

On 13-03-2015 AM 10:24, Amit Kapila wrote:

On Thu, Mar 12, 2015 at 4:22 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>

From Robert's description[1], it looked like the NestLoop with Funnel

would

have Funnel as either outer plan or topmost plan node or NOT a

parameterised

plan. In that case, would this case arise or am I missing something?

Probably not if the costing is right and user doesn't manually disable
plans (like by set enable_* = off). However we should have rescan code
incase it chooses the plan such that Funnel is inner node and I think
apart from that also in few cases Rescan is required.

I see, thanks.

By the way, is it right that TupleQueueFunnel.queue has one shm_mq_handle per
initialized parallel worker? If so, how does TupleQueueFunnel.maxqueues relate
to ParallelContext.nworkers (of the corresponding parallel context)?

Why I asked this is because in CreateTupleQueueFunnel():

funnel->maxqueues = 8;
funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));

So, is the hardcoded "8" intentional or an oversight?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#195Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#194)
Re: Parallel Seq Scan

On 13-03-2015 PM 01:37, Amit Langote wrote:

By the way, is it right that TupleQueueFunnel.queue has one shm_mq_handle per
initialized parallel worker? If so, how does TupleQueueFunnel.maxqueues relate
to ParallelContext.nworkers (of the corresponding parallel context)?

Why I asked this is because in CreateTupleQueueFunnel():

funnel->maxqueues = 8;
funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));

So, is the hardcoded "8" intentional or an oversight?

Oh, I see that in RegisterTupleQueueOnFunnel(), the TupleQueueFunnel.queue is
expanded (repalloc'd) if needed as per corresponding pcxt->nworkers.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#196Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#188)
Re: Parallel Seq Scan

On 12-03-2015 PM 11:46, Amit Kapila wrote:

[parallel_seqscan_v10.patch]

There may be a bug in TupleQueueFunnelNext().

1) I observed a hang with stack looking like:

#0 0x00000039696df098 in poll () from /lib64/libc.so.6
#1 0x00000000006f1c6a in WaitLatchOrSocket (latch=0x7f29dc3c73b4,
wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
#2 0x00000000006f1aca in WaitLatch (latch=0x7f29dc3c73b4, wakeEvents=1,
timeout=0) at pg_latch.c:197
#3 0x000000000065088b in TupleQueueFunnelNext (funnel=0x17b4a20, nowait=0
'\000', done=0x17ad481 "") at tqueue.c:269
#4 0x0000000000636cab in funnel_getnext (funnelstate=0x17ad3d0) at
nodeFunnel.c:347
...
<snip>

2) In some cases, there can be a segmentation fault with stack looking like:

#0 0x000000396968990a in memcpy () from /lib64/libc.so.6
#1 0x00000000006507e7 in TupleQueueFunnelNext (funnel=0x263c800, nowait=0
'\000', done=0x2633461 "") at tqueue.c:233
#2 0x0000000000636cab in funnel_getnext (funnelstate=0x26333b0) at
nodeFunnel.c:347
#3 0x0000000000636901 in ExecFunnel (node=0x26333b0) at nodeFunnel.c:179
...
<snip>

I could get rid of (1) and (2) with the attached fix.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#197Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#196)
1 attachment(s)
Re: Parallel Seq Scan

On 13-03-2015 PM 05:32, Amit Langote wrote:

On 12-03-2015 PM 11:46, Amit Kapila wrote:

[parallel_seqscan_v10.patch]

There may be a bug in TupleQueueFunnelNext().

1) I observed a hang with stack looking like:

#0 0x00000039696df098 in poll () from /lib64/libc.so.6
#1 0x00000000006f1c6a in WaitLatchOrSocket (latch=0x7f29dc3c73b4,
wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
#2 0x00000000006f1aca in WaitLatch (latch=0x7f29dc3c73b4, wakeEvents=1,
timeout=0) at pg_latch.c:197
#3 0x000000000065088b in TupleQueueFunnelNext (funnel=0x17b4a20, nowait=0
'\000', done=0x17ad481 "") at tqueue.c:269
#4 0x0000000000636cab in funnel_getnext (funnelstate=0x17ad3d0) at
nodeFunnel.c:347
...
<snip>

2) In some cases, there can be a segmentation fault with stack looking like:

#0 0x000000396968990a in memcpy () from /lib64/libc.so.6
#1 0x00000000006507e7 in TupleQueueFunnelNext (funnel=0x263c800, nowait=0
'\000', done=0x2633461 "") at tqueue.c:233
#2 0x0000000000636cab in funnel_getnext (funnelstate=0x26333b0) at
nodeFunnel.c:347
#3 0x0000000000636901 in ExecFunnel (node=0x26333b0) at nodeFunnel.c:179
...
<snip>

I could get rid of (1) and (2) with the attached fix.

Hit send too soon!

By the way, the bug seems to be exposed only with a certain pattern/sequence
of workers being detached (perhaps in immediate successive) whereby the
funnel->nextqueue remains incorrectly set.

The patch attached this time.

By the way, when I have asserts enabled, I hit this compilation error:

createplan.c: In function ‘create_partialseqscan_plan’:
createplan.c:1180: error: ‘Path’ has no member named ‘path’

I see following line there:

Assert(best_path->path.parent->rtekind == RTE_RELATION);

Thanks,
Amit

Attachments:

TupleQueueFunnelNext-bugfix.patchtext/x-diff; name=TupleQueueFunnelNext-bugfix.patchDownload
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index ee4e03e..8a6c6f3 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -234,6 +234,8 @@ TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
 				   &funnel->queue[funnel->nextqueue + 1],
 				   sizeof(shm_mq_handle *)
 						* (funnel->nqueues - funnel->nextqueue));
+
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
 			if (funnel->nextqueue < waitpos)
 				--waitpos;
 		}
@@ -260,7 +262,7 @@ TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
 		 * and return NULL (if we're in non-blocking mode) or wait for the
 		 * process latch to be set (otherwise).
 		 */
-		if (funnel->nextqueue == waitpos)
+		if (result != SHM_MQ_DETACHED && funnel->nextqueue == waitpos)
 		{
 			if (nowait)
 				return NULL;
#198Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#197)
1 attachment(s)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 2:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:

On 13-03-2015 PM 05:32, Amit Langote wrote:

On 12-03-2015 PM 11:46, Amit Kapila wrote:

[parallel_seqscan_v10.patch]

There may be a bug in TupleQueueFunnelNext().

1) I observed a hang with stack looking like:

#0 0x00000039696df098 in poll () from /lib64/libc.so.6
#1 0x00000000006f1c6a in WaitLatchOrSocket (latch=0x7f29dc3c73b4,
wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
#2 0x00000000006f1aca in WaitLatch (latch=0x7f29dc3c73b4, wakeEvents=1,
timeout=0) at pg_latch.c:197
#3 0x000000000065088b in TupleQueueFunnelNext (funnel=0x17b4a20,

nowait=0

'\000', done=0x17ad481 "") at tqueue.c:269
#4 0x0000000000636cab in funnel_getnext (funnelstate=0x17ad3d0) at
nodeFunnel.c:347
...
<snip>

2) In some cases, there can be a segmentation fault with stack looking

like:

#0 0x000000396968990a in memcpy () from /lib64/libc.so.6
#1 0x00000000006507e7 in TupleQueueFunnelNext (funnel=0x263c800,

nowait=0

'\000', done=0x2633461 "") at tqueue.c:233
#2 0x0000000000636cab in funnel_getnext (funnelstate=0x26333b0) at
nodeFunnel.c:347
#3 0x0000000000636901 in ExecFunnel (node=0x26333b0) at

nodeFunnel.c:179

...
<snip>

I could get rid of (1) and (2) with the attached fix.

Hit send too soon!

By the way, the bug seems to be exposed only with a certain

pattern/sequence

of workers being detached (perhaps in immediate successive) whereby the
funnel->nextqueue remains incorrectly set.

I think this can happen if funnel->nextqueue is greater
than funnel->nqueues.
Please see if attached patch fixes the issue, else could you share the
scenario in more detail where you hit this issue.

The patch attached this time.

By the way, when I have asserts enabled, I hit this compilation error:

createplan.c: In function ‘create_partialseqscan_plan’:
createplan.c:1180: error: ‘Path’ has no member named ‘path’

I see following line there:

Assert(best_path->path.parent->rtekind == RTE_RELATION);

Okay, will take care of this.

Thanks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_tupqueue_issue_v1.patchapplication/octet-stream; name=fix_tupqueue_issue_v1.patchDownload
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index ee4e03e..8e7f35e 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -230,10 +230,13 @@ TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
 					*done = true;
 				return NULL;
 			}
-			memcpy(&funnel->queue[funnel->nextqueue],
-				   &funnel->queue[funnel->nextqueue + 1],
-				   sizeof(shm_mq_handle *)
-						* (funnel->nqueues - funnel->nextqueue));
+			if (funnel->nextqueue <= funnel->nqueues)
+				memcpy(&funnel->queue[funnel->nextqueue],
+					   &funnel->queue[funnel->nextqueue + 1],
+					   sizeof(shm_mq_handle *)
+							* (funnel->nqueues - funnel->nextqueue));
+			else
+				funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
 			if (funnel->nextqueue < waitpos)
 				--waitpos;
 		}
#199Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#192)
Re: Parallel Seq Scan

On Thu, Mar 12, 2015 at 10:35 PM, Thom Brown <thom@linux.com> wrote:

Another problem. I restarted the instance (just in case), and get this

error:

# \df+ *.*
ERROR: cannot retain locks acquired while in parallel mode

This problem occurs because above statement is trying to
execute parallel_unsafe function (obj_description) in parallelmode.
This will be resolved once parallel_seqscan patch is integrated
with access-parallel-safety patch [1]https://commitfest.postgresql.org/4/155/.

[1]: https://commitfest.postgresql.org/4/155/
https://commitfest.postgresql.org/4/155/

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#200Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#184)
Re: Parallel Seq Scan

On Tue, Mar 10, 2015 at 12:26 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 10, 2015 at 10:23 AM, Haribabu Kommi <kommi.haribabu@gmail.com>

wrote:

On Tue, Mar 10, 2015 at 3:09 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have currently modelled it based on existing rescan for seqscan
(ExecReScanSeqScan()) which means it will begin the scan again.
Basically if the workers are already started/initialized by previous
scan, then re-initialize them (refer function ExecReScanFunnel() in
patch).

Can you elaborate more if you think current handling is not sufficient
for any case?

From ExecReScanFunnel function it seems that the re-scan waits till
all the workers
has to be finished to start again the next scan. Are the workers will
stop the current
ongoing task? otherwise this may decrease the performance instead of
improving as i feel.

Okay, performance-wise it might effect such a case, but I think we can
handle it by not calling WaitForParallelWorkersToFinish(),
as DestroyParallelContext() will automatically terminate all the workers.

We can't directly call DestroyParallelContext() to terminate workers as
it can so happen that by that time some of the workers are still not
started.
So that can lead to problem. I think what we need here is a way to know
whether all workers are started. (basically need a new function
WaitForParallelWorkersToStart()). This API needs to be provided by
parallel-mode patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#201Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#186)
Re: Parallel Seq Scan

On Thu, Mar 12, 2015 at 3:44 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

In create_parallelscan_paths() function the funnel path is added once
the partial seq scan
path is generated. I feel the funnel path can be added once on top of
the total possible
parallel path in the entire query path.

Is this the right patch to add such support also?

This seems to be an optimization for parallel paths which can be
done later as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#202Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#201)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 9:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 12, 2015 at 3:44 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

In create_parallelscan_paths() function the funnel path is added once
the partial seq scan
path is generated. I feel the funnel path can be added once on top of
the total possible
parallel path in the entire query path.

Is this the right patch to add such support also?

This seems to be an optimization for parallel paths which can be
done later as well.

+1. Let's keep it simple for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#203Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#200)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We can't directly call DestroyParallelContext() to terminate workers as
it can so happen that by that time some of the workers are still not
started.

That shouldn't be a problem. TerminateBackgroundWorker() not only
kills an existing worker if there is one, but also tells the
postmaster that if it hasn't started the worker yet, it should not
bother. So at the conclusion of the first loop inside
DestroyParallelContext(), every running worker will have received
SIGTERM and no more workers will be started.

So that can lead to problem. I think what we need here is a way to know
whether all workers are started. (basically need a new function
WaitForParallelWorkersToStart()). This API needs to be provided by
parallel-mode patch.

I don't think so. DestroyParallelContext() is intended to be good
enough for this purpose; if it's not, we should fix that instead of
adding a new function.

No matter what, re-scanning a parallel node is not going to be very
efficient. But the way to deal with that is to make sure that such
nodes have a substantial startup cost, so that the planner won't pick
them in the case where it isn't going to work out well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#204Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#198)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think this can happen if funnel->nextqueue is greater than
funnel->nqueues.
Please see if attached patch fixes the issue, else could you share the
scenario in more detail where you hit this issue.

Speaking as the guy who wrote the first version of that code...

I don't think this is the right fix; the point of that code is to
remove a tuple queue from the funnel when it gets detached, which is a
correct thing to want to do. funnel->nextqueue should always be less
than funnel->nqueues; how is that failing to be the case here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#205Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#204)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 13, 2015 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think this can happen if funnel->nextqueue is greater than
funnel->nqueues.
Please see if attached patch fixes the issue, else could you share the
scenario in more detail where you hit this issue.

Speaking as the guy who wrote the first version of that code...

I don't think this is the right fix; the point of that code is to
remove a tuple queue from the funnel when it gets detached, which is a
correct thing to want to do. funnel->nextqueue should always be less
than funnel->nqueues; how is that failing to be the case here?

I could not reproduce the issue, neither the exact scenario is
mentioned in mail. However what I think can lead to funnel->nextqueue
greater than funnel->nqueues is something like below:

Assume 5 queues, so value of funnel->nqueues will be 5 and
assume value of funnel->nextqueue is 2, so now let us say 4 workers
got detached one-by-one, so for such a case it will always go in else loop
and will never change funnel->nextqueue whereas value of funnel->nqueues
will become 1.

Am I missing something?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#206Amit Langote
amitlangote09@gmail.com
In reply to: Amit Kapila (#205)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 11:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 13, 2015 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I think this can happen if funnel->nextqueue is greater than
funnel->nqueues.
Please see if attached patch fixes the issue, else could you share the
scenario in more detail where you hit this issue.

Speaking as the guy who wrote the first version of that code...

I don't think this is the right fix; the point of that code is to
remove a tuple queue from the funnel when it gets detached, which is a
correct thing to want to do. funnel->nextqueue should always be less
than funnel->nqueues; how is that failing to be the case here?

I could not reproduce the issue, neither the exact scenario is
mentioned in mail. However what I think can lead to funnel->nextqueue
greater than funnel->nqueues is something like below:

Assume 5 queues, so value of funnel->nqueues will be 5 and
assume value of funnel->nextqueue is 2, so now let us say 4 workers
got detached one-by-one, so for such a case it will always go in else loop
and will never change funnel->nextqueue whereas value of funnel->nqueues
will become 1.

Am I missing something?

Sorry, I did not mention the exact example I'd used but I thought it
was just any arbitrary example:

CREATE TABLE t1(c1, c2) SELECT g1, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) SELECT g1, repeat('x', 5) FROM
generate_series(1, 1000000) g;

SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;

The observed behavior included a hang or segfault arbitrarily (that's
why I guessed it may be arbitrariness of sequence of detachment of
workers).

Changed parameters to cause plan to include a Funnel:

parallel_seqscan_degree = 8
cpu_tuple_communication_cost = 0.01/0.001

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#207Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#191)
Re: Parallel Seq Scan

On Thu, Mar 12, 2015 at 9:50 PM, Thom Brown <thom@linux.com> wrote:

On 12 March 2015 at 15:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

Please note that parallel_setup_cost and parallel_startup_cost are
still set to zero by default, so you need to set it to higher values
if you don't want the parallel plans once parallel_seqscan_degree
is set. I have yet to comeup with default values for them, needs
some tests.

Thanks. Getting a problem:

Thanks for looking into patch.

So as per this report, I am seeing 3 different problems in it.

Problem-1:
---------------------

# SELECT name, setting FROM pg_settings WHERE name IN
('parallel_seqscan_degree','max_worker_processes','seq_page_cost');
name | setting
-------------------------+---------
max_worker_processes | 20
parallel_seqscan_degree | 8
seq_page_cost | 1000
(3 rows)

# EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
ERROR: too many dynamic shared memory segments

This happens because we have maximum limit on the number of
dynamic shared memory segments in the system.

In function dsm_postmaster_startup(), it is defined as follows:

maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

In the above case, it is choosing parallel plan for each of the
AppendRelation,
(because of seq_page_cost = 1000) and that causes the test to
cross max limit of dsm segments.

One way to fix could be that we increase the number of dsm segments
that can be created in a system/backend, but it seems to me that in
reality there might not be many such plans which would need so many
dsm segments, unless user tinkers too much with costing and even if
he does, he can increase max_connections to avoid such problem.

I would like to see opinion of other people on this matter.

Problem-2:
--------------------
2015-03-12 16:09:30 GMT [7880]: [36-1] user=,db=,client= LOG: server
process (PID 7889) was terminated by signal 11: Segmentation fault
2015-03-12 16:09:30 GMT [7880]: [37-1] user=,db=,client= DETAIL:
Failed process was running: SELECT pg_catalog.quote_ident(c.relname)
FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm',
'f') AND substring(pg_catalog.quote_ident(c.relname),1,10)='pgbench_br'
AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <>
(SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog')
UNION
SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM
pg_catalog.pg_namespace n WHERE
substring(pg_catalog.quote_ident(n.nspname) || '.',1,10)='pgbench_br'
AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
substring(pg_catalog.quote_ident(nspname) || '.',1,10) =
substring('pgbench_br',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1))

1

UNION
SELECT pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c,
pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind
IN ('r', 'S', 'v', 'm', 'f') AND
substring(pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname),1,10)='pgbench_br' AND substri

This seems to be unrelated to first issue (as the statement in log has
nothing to do with Problem-1) and this could be same issue what
Amit Langote has reported, so we can test this once with the fix for that
issue, but I think it is important if we can isolate the test due to which
this problem has occurred.

Problem-3
----------------
I am seeing as Assertion failure (in ExitParallelMode()) with this test,
but that seems to be an issue due to the lack of integration with
access-parallel-safety patch. I will test this after integrating with
access-parallel-safety patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#208Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#205)
Re: Parallel Seq Scan

On 13-03-2015 PM 11:03, Amit Kapila wrote:

On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think this is the right fix; the point of that code is to
remove a tuple queue from the funnel when it gets detached, which is a
correct thing to want to do. funnel->nextqueue should always be less
than funnel->nqueues; how is that failing to be the case here?

I could not reproduce the issue, neither the exact scenario is
mentioned in mail. However what I think can lead to funnel->nextqueue
greater than funnel->nqueues is something like below:

Assume 5 queues, so value of funnel->nqueues will be 5 and
assume value of funnel->nextqueue is 2, so now let us say 4 workers
got detached one-by-one, so for such a case it will always go in else loop
and will never change funnel->nextqueue whereas value of funnel->nqueues
will become 1.

Or if the just-detached queue happens to be the last one, we'll make
shm_mq_receive() to read from a potentially already-detached queue in the
immediately next iteration. That seems to be caused by not having updated the
funnel->nextqueue. With the returned value being SHM_MQ_DETACHED, we'll again
try to remove it from the queue. In this case, it causes the third argument to
memcpy be negative and hence the segfault.

I can't seem to really figure out the other problem of waiting forever in
WaitLatch() but I had managed to make it go away with:

-        if (funnel->nextqueue == waitpos)
+        if (result != SHM_MQ_DETACHED && funnel->nextqueue == waitpos)

By the way, you can try reproducing this with the example I posted on Friday.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#209Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#208)
Re: Parallel Seq Scan

On Mon, Mar 16, 2015 at 9:40 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:

On 13-03-2015 PM 11:03, Amit Kapila wrote:

On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I don't think this is the right fix; the point of that code is to
remove a tuple queue from the funnel when it gets detached, which is a
correct thing to want to do. funnel->nextqueue should always be less
than funnel->nqueues; how is that failing to be the case here?

I could not reproduce the issue, neither the exact scenario is
mentioned in mail. However what I think can lead to funnel->nextqueue
greater than funnel->nqueues is something like below:

Assume 5 queues, so value of funnel->nqueues will be 5 and
assume value of funnel->nextqueue is 2, so now let us say 4 workers
got detached one-by-one, so for such a case it will always go in else

loop

and will never change funnel->nextqueue whereas value of funnel->nqueues
will become 1.

Or if the just-detached queue happens to be the last one, we'll make
shm_mq_receive() to read from a potentially already-detached queue in the
immediately next iteration.

Won't the last queue case already handled by below code:
else
{
--funnel->nqueues;
if (funnel->nqueues == 0)
{
if (done != NULL)
*done = true;
return NULL;
}

That seems to be caused by not having updated the
funnel->nextqueue. With the returned value being SHM_MQ_DETACHED, we'll

again

try to remove it from the queue. In this case, it causes the third

argument to

memcpy be negative and hence the segfault.

In anycase, I think we need some handling for such cases.

I can't seem to really figure out the other problem of waiting forever in
WaitLatch()

The reason seems that for certain scenarios, the way we set the latch before
exiting needs some more thought. Currently we are setting the latch in
HandleParallelMessageInterrupt(), that doesn't seem to be sufficient.

By the way, you can try reproducing this with the example I posted on

Friday.

Sure.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#210Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#168)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 7:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Amit Kapila wrote:

I think this can happen if funnel->nextqueue is greater
than funnel->nqueues.
Please see if attached patch fixes the issue, else could you share the
scenario in more detail where you hit this issue.

Uh, isn't this copying an overlapping memory region? If so you should
be using memmove instead.

Agreed, will update this in next version of patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#211Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#209)
Re: Parallel Seq Scan

On 16-03-2015 PM 04:14, Amit Kapila wrote:

On Mon, Mar 16, 2015 at 9:40 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:

Or if the just-detached queue happens to be the last one, we'll make
shm_mq_receive() to read from a potentially already-detached queue in the
immediately next iteration.

Won't the last queue case already handled by below code:
else
{
--funnel->nqueues;
if (funnel->nqueues == 0)
{
if (done != NULL)
*done = true;
return NULL;
}

Actually I meant "currently the last" or:

funnel->nextqueue == funnel->nqueue - 1

So the code you quote would only take care of subset of the cases.

Imagine funnel->nqueues going down from 5 to 3 in successive iterations while
funnel->nextqueue remains set to 4 (which would have been the "currently last"
when funnel->nqueues was 5).

I can't seem to really figure out the other problem of waiting forever in
WaitLatch()

The reason seems that for certain scenarios, the way we set the latch before
exiting needs some more thought. Currently we are setting the latch in
HandleParallelMessageInterrupt(), that doesn't seem to be sufficient.

How about shm_mq_detach() called from ParallelQueryMain() right after
exec_parallel_stmt() returns? Doesn't that do the SetLatch() that needs to be
done by a worker?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#212Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#203)
Re: Parallel Seq Scan

On Fri, Mar 13, 2015 at 7:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 13, 2015 at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

We can't directly call DestroyParallelContext() to terminate workers as
it can so happen that by that time some of the workers are still not
started.

That shouldn't be a problem. TerminateBackgroundWorker() not only
kills an existing worker if there is one, but also tells the
postmaster that if it hasn't started the worker yet, it should not
bother. So at the conclusion of the first loop inside
DestroyParallelContext(), every running worker will have received
SIGTERM and no more workers will be started.

The problem occurs in second loop inside DestroyParallelContext()
where it calls WaitForBackgroundWorkerShutdown(). Basically
WaitForBackgroundWorkerShutdown() just checks for BGWH_STOPPED
status, refer below code in parallel-mode patch:

+ status = GetBackgroundWorkerPid(handle, &pid);
+ if (status == BGWH_STOPPED)
+ return status;

So if the status here returned is BGWH_NOT_YET_STARTED, then it
will go for WaitLatch and will there forever.

I think fix is to check if status is BGWH_STOPPED or BGWH_NOT_YET_STARTED,
then just return the status.

What do you say?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#213Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#212)
Re: Parallel Seq Scan

On Tue, Mar 17, 2015 at 1:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The problem occurs in second loop inside DestroyParallelContext()
where it calls WaitForBackgroundWorkerShutdown(). Basically
WaitForBackgroundWorkerShutdown() just checks for BGWH_STOPPED
status, refer below code in parallel-mode patch:

+ status = GetBackgroundWorkerPid(handle, &pid);
+ if (status == BGWH_STOPPED)
+ return status;

So if the status here returned is BGWH_NOT_YET_STARTED, then it
will go for WaitLatch and will there forever.

I think fix is to check if status is BGWH_STOPPED or BGWH_NOT_YET_STARTED,
then just return the status.

What do you say?

No, that's not right. If we return when the status is
BGWH_NOT_YET_STARTED, then the postmaster could subsequently start the
worker.

Can you try this:

diff --git a/src/backend/postmaster/bgworker.c
b/src/backend/postmaster/bgworker.c
index f80141a..39b919f 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -244,6 +244,8 @@ BackgroundWorkerStateChange(void)
                                rw->rw_terminate = true;
                                if (rw->rw_pid != 0)
                                        kill(rw->rw_pid, SIGTERM);
+                               else
+                                       ReportBackgroundWorkerPID(rw);
                        }
                        continue;
                }

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#214Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#213)
Re: Parallel Seq Scan

On Tue, Mar 17, 2015 at 7:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 17, 2015 at 1:42 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

The problem occurs in second loop inside DestroyParallelContext()
where it calls WaitForBackgroundWorkerShutdown(). Basically
WaitForBackgroundWorkerShutdown() just checks for BGWH_STOPPED
status, refer below code in parallel-mode patch:

+ status = GetBackgroundWorkerPid(handle, &pid);
+ if (status == BGWH_STOPPED)
+ return status;

So if the status here returned is BGWH_NOT_YET_STARTED, then it
will go for WaitLatch and will there forever.

I think fix is to check if status is BGWH_STOPPED or

BGWH_NOT_YET_STARTED,

then just return the status.

What do you say?

No, that's not right. If we return when the status is
BGWH_NOT_YET_STARTED, then the postmaster could subsequently start the
worker.

Can you try this:

diff --git a/src/backend/postmaster/bgworker.c
b/src/backend/postmaster/bgworker.c
index f80141a..39b919f 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -244,6 +244,8 @@ BackgroundWorkerStateChange(void)
rw->rw_terminate = true;
if (rw->rw_pid != 0)
kill(rw->rw_pid, SIGTERM);
+                               else
+                                       ReportBackgroundWorkerPID(rw);
}
continue;
}

It didn't fix the problem. IIUC, you have done this to ensure that
if worker is not already started, then update it's pid, so that we
can get the required status in WaitForBackgroundWorkerShutdown().
As this is a timing issue, it can so happen that before Postmaster
gets a chance to report the pid, backend has already started waiting
on WaitLatch().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#215Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#214)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Mar 18, 2015 at 2:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Can you try this:

diff --git a/src/backend/postmaster/bgworker.c
b/src/backend/postmaster/bgworker.c
index f80141a..39b919f 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -244,6 +244,8 @@ BackgroundWorkerStateChange(void)
rw->rw_terminate = true;
if (rw->rw_pid != 0)
kill(rw->rw_pid, SIGTERM);
+                               else
+                                       ReportBackgroundWorkerPID(rw);
}
continue;
}

It didn't fix the problem. IIUC, you have done this to ensure that
if worker is not already started, then update it's pid, so that we
can get the required status in WaitForBackgroundWorkerShutdown().
As this is a timing issue, it can so happen that before Postmaster
gets a chance to report the pid, backend has already started waiting
on WaitLatch().

I think I figured out the problem. That fix only helps in the case
where the postmaster noticed the new registration previously but
didn't start the worker, and then later notices the termination.
What's much more likely to happen is that the worker is started and
terminated so quickly that both happen before we create a
RegisteredBgWorker for it. The attached patch fixes that case, too.

Assuming this actually fixes the problem, I think we should back-patch
it into 9.4. To recap, the problem is that, at present, if you
register a worker and then terminate it before it's launched,
GetBackgroundWorkerPid() will still return BGWH_NOT_YET_STARTED, which
it makes it seem like we're still waiting for it to start. But when
or if the slot is reused for an unrelated registration, then
GetBackgroundWorkerPid() will switch to returning BGWH_STOPPED. It's
hard to believe that's the behavior anyone wants. With this patch,
the return value will always be BGWH_STOPPED in this situation. That
has the virtue of being consistent, and practically speaking I think
it's the behavior that everyone will want, because the case where this
matters is when you are waiting for workers to start or waiting for
worker to stop, and in either case you will want to treat a worker
that was marked for termination before the postmaster actually started
it as already-stopped rather than not-yet-started.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

stop-notify-fix-v2.patchbinary/octet-stream; name=stop-notify-fix-v2.patchDownload
From 95579c930ee14e2baa60149e7084e4a97b02bc80 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 17 Mar 2015 11:09:29 -0400
Subject: [PATCH 5/5] Immediately notify about a dead, never-started worker.

---
 src/backend/postmaster/bgworker.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f80141a..fe94c8d 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -244,14 +244,37 @@ BackgroundWorkerStateChange(void)
 				rw->rw_terminate = true;
 				if (rw->rw_pid != 0)
 					kill(rw->rw_pid, SIGTERM);
+				else
+				{
+					/* Report never-started, now-terminated worker as dead. */
+					ReportBackgroundWorkerPID(rw);
+				}
 			}
 			continue;
 		}
 
-		/* If it's already flagged as do not restart, just release the slot. */
+		/*
+		 * If the worker is marked for termination, we don't need to add it
+		 * to the registered workers list; we can just free the slot.
+		 * However, if bgw_notify_pid is set, the process that registered the
+		 * worker may need to know that we've processed the terminate request,
+		 * so be sure to signal it.
+		 */
 		if (slot->terminate)
 		{
+			int	notify_pid;
+
+			/*
+			 * We need a memory barrier here to make sure that the load of
+			 * bgw_notify_pid completes before the store to in_use.
+			 */
+			notify_pid = slot->worker.bgw_notify_pid;
+			pg_memory_barrier();
+			slot->pid = 0;
 			slot->in_use = false;
+			if (notify_pid != 0)
+				kill(notify_pid, SIGUSR1);
+
 			continue;
 		}
 
-- 
1.7.9.6 (Apple Git-31.1)

#216Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#207)
Re: Parallel Seq Scan

On Sat, Mar 14, 2015 at 1:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

# EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
ERROR: too many dynamic shared memory segments

This happens because we have maximum limit on the number of
dynamic shared memory segments in the system.

In function dsm_postmaster_startup(), it is defined as follows:

maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

In the above case, it is choosing parallel plan for each of the
AppendRelation,
(because of seq_page_cost = 1000) and that causes the test to
cross max limit of dsm segments.

The problem here is, of course, that each parallel sequential scan is
trying to create an entirely separate group of workers. Eventually, I
think we should fix this by rejiggering things so that when there are
multiple parallel nodes in a plan, they all share a pool of workers.
So each worker would actually get a list of plan nodes instead of a
single plan node. Maybe it works on the first node in the list until
that's done, and then moves onto the next, or maybe it round-robins
among all the nodes and works on the ones where the output tuple
queues aren't currently full, or maybe the master somehow notifies the
workers which nodes are most useful to work on at the present time.
But I think trying to figure this out is far too ambitious for 9.5,
and I think we can have a useful feature without implementing any of
it.

But, we can't just ignore the issue right now, because erroring out on
a large inheritance hierarchy is no good. Instead, we should fall
back to non-parallel operation in this case. By the time we discover
the problem, it's too late to change the plan, because it's already
execution time. So we are going to be stuck executing the parallel
node - just with no workers to help. However, what I think we can do
is use a slab of backend-private memory instead of a dynamic shared
memory segment, and in that way avoid this error. We do something
similar when starting the postmaster in stand-alone mode: the main
shared memory segment is replaced by a backend-private allocation with
the same contents that the shared memory segment would normally have.
The same fix will work here.

Even once we make the planner and executor smarter, so that they don't
create lots of shared memory segments and lots of separate worker
pools in this type of case, it's probably still useful to have this as
a fallback approach, because there's always the possibility that some
other client of the dynamic shared memory system could gobble up all
the segments. So, I'm going to go try to figure out the best way to
implement this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#217Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#216)
Re: Parallel Seq Scan

On Wed, Mar 18, 2015 at 10:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Mar 14, 2015 at 1:04 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

# EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
ERROR: too many dynamic shared memory segments

This happens because we have maximum limit on the number of
dynamic shared memory segments in the system.

In function dsm_postmaster_startup(), it is defined as follows:

maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

In the above case, it is choosing parallel plan for each of the
AppendRelation,
(because of seq_page_cost = 1000) and that causes the test to
cross max limit of dsm segments.

The problem here is, of course, that each parallel sequential scan is
trying to create an entirely separate group of workers. Eventually, I
think we should fix this by rejiggering things so that when there are
multiple parallel nodes in a plan, they all share a pool of workers.
So each worker would actually get a list of plan nodes instead of a
single plan node. Maybe it works on the first node in the list until
that's done, and then moves onto the next, or maybe it round-robins
among all the nodes and works on the ones where the output tuple
queues aren't currently full, or maybe the master somehow notifies the
workers which nodes are most useful to work on at the present time.

Good idea. I think for this particular case, we might want to optimize
the work distribution such each worker gets one independent relation
segment to scan.

But I think trying to figure this out is far too ambitious for 9.5,
and I think we can have a useful feature without implementing any of
it.

Agreed.

But, we can't just ignore the issue right now, because erroring out on
a large inheritance hierarchy is no good. Instead, we should fall
back to non-parallel operation in this case. By the time we discover
the problem, it's too late to change the plan, because it's already
execution time. So we are going to be stuck executing the parallel
node - just with no workers to help. However, what I think we can do
is use a slab of backend-private memory instead of a dynamic shared
memory segment, and in that way avoid this error. We do something
similar when starting the postmaster in stand-alone mode: the main
shared memory segment is replaced by a backend-private allocation with
the same contents that the shared memory segment would normally have.
The same fix will work here.

Even once we make the planner and executor smarter, so that they don't
create lots of shared memory segments and lots of separate worker
pools in this type of case, it's probably still useful to have this as
a fallback approach, because there's always the possibility that some
other client of the dynamic shared memory system could gobble up all
the segments. So, I'm going to go try to figure out the best way to
implement this.

Thanks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#218Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#215)
Re: Parallel Seq Scan

On Wed, Mar 18, 2015 at 9:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 18, 2015 at 2:22 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

It didn't fix the problem. IIUC, you have done this to ensure that
if worker is not already started, then update it's pid, so that we
can get the required status in WaitForBackgroundWorkerShutdown().
As this is a timing issue, it can so happen that before Postmaster
gets a chance to report the pid, backend has already started waiting
on WaitLatch().

I think I figured out the problem. That fix only helps in the case
where the postmaster noticed the new registration previously but
didn't start the worker, and then later notices the termination.
What's much more likely to happen is that the worker is started and
terminated so quickly that both happen before we create a
RegisteredBgWorker for it. The attached patch fixes that case, too.

Patch fixes the problem and now for Rescan, we don't need to Wait
for workers to finish.

Assuming this actually fixes the problem, I think we should back-patch
it into 9.4.

+1

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#219Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#218)
Re: Parallel Seq Scan

On Wed, Mar 18, 2015 at 11:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Patch fixes the problem and now for Rescan, we don't need to Wait
for workers to finish.

Assuming this actually fixes the problem, I think we should back-patch
it into 9.4.

+1

OK, done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#220Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#211)
1 attachment(s)
Re: Parallel Seq Scan

On Mon, Mar 16, 2015 at 12:58 PM, Amit Langote <
Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 16-03-2015 PM 04:14, Amit Kapila wrote:

On Mon, Mar 16, 2015 at 9:40 AM, Amit Langote <

Langote_Amit_f8@lab.ntt.co.jp>

wrote:

Or if the just-detached queue happens to be the last one, we'll make
shm_mq_receive() to read from a potentially already-detached queue in

the

immediately next iteration.

Won't the last queue case already handled by below code:
else
{
--funnel->nqueues;
if (funnel->nqueues == 0)
{
if (done != NULL)
*done = true;
return NULL;
}

Actually I meant "currently the last" or:

funnel->nextqueue == funnel->nqueue - 1

So the code you quote would only take care of subset of the cases.

Fixed this issue by resetting funnel->next queue to zero (as per offlist
discussion with Robert), so that it restarts from first queue in such
a case.

I can't seem to really figure out the other problem of waiting forever

in

WaitLatch()

The reason seems that for certain scenarios, the way we set the latch

before

exiting needs some more thought. Currently we are setting the latch in
HandleParallelMessageInterrupt(), that doesn't seem to be sufficient.

How about shm_mq_detach() called from ParallelQueryMain() right after
exec_parallel_stmt() returns? Doesn't that do the SetLatch() that needs

to be

done by a worker?

Fixed this issue by not going for Wait incase of detached queues.

Apart from these fixes, latest patch contains below changes:

1. Integrated with assess-parallel-safety-v4.patch [1]/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com. To test
with this patch, please remember to comment below line
in this patch, else it will always enter parallel-mode.
+ glob->parallelModeNeeded = glob->parallelModeOK; /* XXX JUST FOR
TESTING */

2. Handle the case where enough workers are not available for
execution of Funnel node. In such a case it will run the plan
with available number of workers and incase no worker is available,
it will just run the local partial seq scan node. I think we can
invent some more advanced solution to handle this problem in
case there is a strong need after the first version went in.

3. Support for pg_stat_statements (it will show the stats for parallel-
statement). To handle this case, we need to share buffer usage
stats from all the workers. Currently the patch does collect
buffer usage stats by default (even though pg_stat_statements is
not enabled) as that is quite cheap and we can make it conditional
if required in future.

So the patches have to be applied in below sequence:
HEAD Commit-id : 8d1f2390
parallel-mode-v8.1.patch [2]/messages/by-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com
assess-parallel-safety-v4.patch [1]/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
parallel-heap-scan.patch [3]/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
parallel_seqscan_v11.patch (Attached with this mail)

The reason for not using the latest commit in HEAD is that latest
version of assess-parallel-safety patch was not getting applied,
so I generated the patch at commit-id where I could apply that
patch successfully.

[1]: /messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
[2]: /messages/by-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com
/messages/by-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com
[3]: /messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v11.patchapplication/octet-stream; name=parallel_seqscan_v11.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 383e15b..d384e8f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1582,6 +1582,20 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c639552..8b2cc26 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,8 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -916,6 +918,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1065,6 +1073,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1206,6 +1216,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((FunnelState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((FunnelState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1322,6 +1350,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
 			break;
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -1331,6 +1360,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2214,6 +2251,8 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..991ff51 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -16,14 +16,15 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
+       nodeSeqscan.o nodePartialSeqscan.o nodeSetOp.o nodeSort.o \
+       nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o \
+       spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 6ebad2f..10dc319 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -37,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -155,6 +157,14 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -458,6 +468,10 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index 1c8be25..f13b7bcb 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 5f361d2..5bc4da2 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -181,6 +181,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		estate->es_param_exec_vals = (ParamExecData *)
 			palloc0(queryDesc->plannedstmt->nParamExec * sizeof(ParamExecData));
 
+	estate->toc = queryDesc->toc;
+
 	/*
 	 * If non-read-only query, set the command ID to mark output tuples with
 	 */
@@ -318,6 +320,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	operation = queryDesc->operation;
 	dest = queryDesc->dest;
 
+	/* inform executor to collect buffer usage stats from parallel workers. */
+	estate->total_time = queryDesc->totaltime ? 1 : 0;
+
 	/*
 	 * startup tuple receiver, if we will be emitting tuples
 	 */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..1a1275c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +192,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +418,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +664,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 022041b..79eeaee 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -145,6 +145,8 @@ CreateExecutorState(void)
 
 	estate->es_auxmodifytables = NIL;
 
+	estate->toc = NULL;
+
 	estate->es_per_tuple_exprcontext = NULL;
 
 	estate->es_epqTuple = NULL;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..283a136 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,30 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +167,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..8af19a4
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,408 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		rescans a relation
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodeFunnel.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/* Initialize the workers required to perform parallel scan. */
+	InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+							  estate,
+							  currentRelation,
+							  &node->inst_options_space,
+							  &node->buffer_usage_space,
+							  &node->responseq,
+							  &node->pcxt,
+							  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+	estate->toc = node->pcxt->toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	 /* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	/* Initialize scan state of workers. */
+	funnelstate->all_workers_done = false;
+	funnelstate->local_scan_done = false;
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * If parallel context is set and workers are not registered,
+	 * register them now.  If there are no more workers available,
+	 * then the funnel node will just scan locally, however we will
+	 * retry launcing the workers in each pass so that if some new
+	 * worker becomes available we can use the same.
+	 */
+	if (node->pcxt && !node->fs_workersReady)
+	{
+		bool any_worker_launched = false;
+
+		/* Register backend workers. */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+	
+	slot = funnel_getnext(node);
+	
+	/*
+	 * if required by plugin, aggregate the buffer usage stats
+	 * from all workers.
+	 */
+	if (TupIsNull(slot))
+	{
+		int i;
+		int nworkers;
+		BufferUsage *buffer_usage_worker;
+		char *buffer_usage;
+
+		if (node->ss.ps.state->total_time)
+		{
+			nworkers = node->pcxt->nworkers;
+			buffer_usage = node->buffer_usage_space;
+
+			for (i = 0; i < nworkers; i++)
+			{
+				buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * sizeof(BufferUsage)));
+				BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+			}
+		}
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+
+	if (node->pcxt && node->fs_workersReady)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+	}
+	else if (node->pcxt)
+	{
+		int i;
+
+		/*
+		 * We only need to free the memory allocated to initialize
+		 * parallel workers as workers are still not started.
+		 */
+		dlist_delete(&node->pcxt->node);
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].error_mqh != NULL)
+			{
+				pfree(node->pcxt->worker[i].error_mqh);
+				node->pcxt->worker[i].error_mqh = NULL;
+			}
+		}
+		
+		/*
+		 * If we have allocated a shared memory segment, detach it.  This will
+		 * implicitly detach the error queues, and any other shared memory
+		 * queues, stored there.
+		 */
+		if (node->pcxt->seg != NULL)
+			dsm_detach(node->pcxt->seg);
+
+		/* Free the worker array itself. */
+		pfree(node->pcxt->worker);
+		node->pcxt->worker = NULL;
+
+		/* Free memory. */
+		pfree(node->pcxt);
+	}
+}
+
+/*
+ * funnel_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in funnel scan and if there is no
+ *  data available from queues or no worker is available, it does
+ *  fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while ((!funnelstate->all_workers_done  && funnelstate->fs_workersReady) ||
+			!funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans a relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.
+	 */
+	if (node->fs_workersReady)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+
+		/* Initialize the workers required to perform parallel scan. */
+		InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+								  estate,
+								  node->ss.ss_currentRelation,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+
+	estate->toc = node->pcxt->toc;
+
+	ExecReScan(node->ss.ps.lefttree);
+}
+
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..55aa266
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/execdebug.h"
+#include "executor/nodeSeqscan.h"
+#include "executor/nodePartialSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss_currentScanDesc;
+	estate = node->ps.state;
+	direction = estate->es_direction;
+	slot = node->ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	HeapScanDesc currentScanDesc;
+	ParallelHeapScanDesc pscan;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.
+	 */
+	Assert(estate->toc);
+	
+	pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+	currentScanDesc = heap_beginscan_parallel(currentRelation, pscan);
+
+	node->ss_currentRelation = currentRelation;
+	node->ss_currentScanDesc = currentScanDesc;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ps.plan = (Plan *) node;
+	scanstate->ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ps);
+	ExecInitScanTupleSlot(estate, scanstate);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ps);
+	ExecAssignScanProjectionInfo(scanstate);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss_currentRelation;
+	scanDesc = node->ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	HeapScanDesc scan;
+	ParallelHeapScanDesc pscan;
+	EState	   *estate = node->ps.state;
+
+	Assert(estate->toc);
+	
+	pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+	scan = node->ss_currentScanDesc;
+
+	heap_parallel_rescan(pscan,			/* scan desc */
+						 scan);			/* new scan keys */
+
+	ExecScanReScan((ScanState *) node);
+}
\ No newline at end of file
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..e4933e6
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static void
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is until the number of remaining queues fall below that pointer and
+		 * at that point make the pointer point to the first queue.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+
+			memmove(&funnel->queue[funnel->nextqueue],
+					&funnel->queue[funnel->nextqueue + 1],
+					sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+
+			if (funnel->nextqueue >= funnel->nqueues)
+				funnel->nextqueue = 0;
+
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+
+			continue;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 5994433..01e951b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -355,6 +355,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4053,6 +4090,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1aa1f55..05d4b3c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -440,6 +440,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2898,6 +2916,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..6b633d6 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,186 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+						num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 563209c..d4570f2 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1280,6 +1280,92 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1409,6 +1495,12 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1a0d358..874c272 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -220,6 +228,55 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info,
+			int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..d152d73
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,83 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "nodes/relation.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/restrictinfo.h"
+#include "optimizer/clauses.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0
+	 * and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cb69c03..c8422c9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,11 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path,
+								List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +105,12 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist,
+										   List *qpqual,
+										   Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+						   Index scanrelid, int nworkers,
+						   Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +239,8 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +356,20 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path,
+											   tlist,
+											   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +573,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1162,87 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path,
+				   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Plan	   *subplan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3321,6 +3431,45 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 84560bc..83576c4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -275,6 +275,51 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+								   List *rangetable)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ec828cd..ef8c317 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -435,6 +435,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -445,6 +446,24 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index acfd0bc..f649639 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2167,6 +2167,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index faca30b..0e5fd3a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,53 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+							Path* subpath, int nWorkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nWorkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..a06c38f
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,425 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitializeParallelWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "executor/nodeFunnel.h"
+#include "miscadmin.h"
+#include "nodes/parsenodes.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "postmaster/backendworker.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size);
+static void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters and instrumentation information that need to be
+ * retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the usage along with it's own, so account it for each worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 3);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters and instrumentation information
+ * required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	*paramsdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by
+	 * each worker.
+	 */
+	*buffer_usage_space =
+			shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePartialSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size)
+{
+	/* Estimate space for partial seq. scan specific contents. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StorePartialSeqScan
+ * 
+ * Sets up the planned statement and block range for parallel
+ * sequence scan.
+ */
+void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size)
+{
+	char		*plannedstmtdata;
+	ParallelHeapScanDesc pscan;
+
+	/* Store range table list in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Plan *plan, EState *estate, Relation rel,
+						  char **inst_options_space, char **buffer_usage_space,
+						  shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size, pscan_size, plannedstmt_size;
+	char	   *plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt((PartialSeqScan *)plan,
+													 estate->es_range_table);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePartialSeqScanSpace(pcxt, estate, plannedstmt_str,
+								&plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 estate->es_instrument, &params_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePartialSeqScan(pcxt, estate, rel, plannedstmt_str,
+						plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 estate->es_instrument,
+							 params_size, inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameter's and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char		*paramsdata;
+	char		*inst_options_space;
+	char		*buffer_usage_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+
+	buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+	*buffer_usage = (buffer_usage_space +
+					 ParallelWorkerNumber * sizeof(BufferUsage));
+}
+
+/*
+ * GetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->qual);
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->targetlist);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	int				inst_options;
+	char			*instrument = NULL;
+	char			*buffer_usage = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	GetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &inst_options,
+						   &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->toc = toc;
+	parallelstmt->responseq = responseq;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9b2e7f3..0c6b481 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..7a9ce3e 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -129,6 +130,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +166,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +209,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +254,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c18298..bc967ee 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -55,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1192,6 +1194,96 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	if (parallelstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestTupleQueue);
+		SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	queryDesc->toc = parallelstmt->toc;
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required
+	 * by plugins to report the total usage for statement execution.
+	 */
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0bbc67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -80,6 +80,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->params = params;		/* parameter values passed into query */
 	qd->instrument_options = instrument_options;		/* instrumentation
 														 * wanted? */
+	qd->toc = NULL;		/* need to be set by the caller before ExecutorStart */
 
 	/* null these fields until set by ExecutorStart */
 	qd->tupDesc = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9c74ed3..fc1d639 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -608,6 +608,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2557,6 +2559,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2744,6 +2756,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 110983f..06c5969 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -291,6 +291,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -501,6 +504,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d36e738..0a34b48 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index a2381cd..56b7c75 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,7 @@ typedef struct QueryDesc
 	DestReceiver *dest;			/* the destination for tuple output */
 	ParamListInfo params;		/* param values being passed in */
 	int			instrument_options;		/* OR of InstrumentOption flags */
+	shm_toc		*toc;			/* to fetch the information from dsm */
 
 	/* These fields are set by ExecutorStart */
 	TupleDesc	tupDesc;		/* descriptor for result tuples */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..0d28606 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+	InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..3af3a0e
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cb05be7
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..c979233
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,34 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 59b17f3..f829175 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
 #include "utils/reltrigger.h"
@@ -389,6 +391,18 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
+	 * This is required to collect buffer usage stats from parallel
+	 * workers when requested by plugins.
+	 */
+	bool		total_time;	/* total time spent in ExecutorRun */
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1213,6 +1227,41 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * PartialSeqScan uses a bare SeqScanState as its state node, since
+ * it needs no additional fields.
+ */
+typedef SeqScanState PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	char			*buffer_usage_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 38469ef..3f3d572 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +99,8 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +221,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..65b60a0 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index fefddb5..b17021f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,10 +20,15 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
+#include "nodes/params.h"
+#include "nodes/plannodes.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/shm_toc.h"
+#include "storage/shm_mq.h"
 
 /* Possible sources of a Query */
 typedef enum QuerySource
@@ -156,6 +161,17 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	shm_toc			*toc;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+	char			*buffer_usage;
+} ParallelStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 5f0ea1c..7cdf632 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -281,6 +281,22 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 72eb49b..c3e1f6a 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -741,6 +741,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..11f0409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..7873565 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,11 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index cd62aec..7bc7d7e 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+											List *rangetable);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..bf91824
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,40 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define PARALLEL_KEY_BUFF_USAGE		2
+#define PARALLEL_KEY_INST_OPTIONS	3
+#define PARALLEL_KEY_INST_INFO		4
+#define PARALLEL_KEY_TUPLE_QUEUE	5
+#define PARALLEL_KEY_SCAN			6
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Plan *plan, EState *estate,
+									  Relation rel, char **inst_options_space,
+									  char **buffer_usage_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..b560672 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index b3c705f..5c25627 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#221Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#220)
Re: Parallel Seq Scan

On 20-03-2015 PM 09:06, Amit Kapila wrote:

On Mon, Mar 16, 2015 at 12:58 PM, Amit Langote <
Langote_Amit_f8@lab.ntt.co.jp> wrote:

Actually I meant "currently the last" or:

funnel->nextqueue == funnel->nqueue - 1

So the code you quote would only take care of subset of the cases.

Fixed this issue by resetting funnel->next queue to zero (as per offlist
discussion with Robert), so that it restarts from first queue in such
a case.

How about shm_mq_detach() called from ParallelQueryMain() right after
exec_parallel_stmt() returns? Doesn't that do the SetLatch() that needs

to be

done by a worker?

Fixed this issue by not going for Wait incase of detached queues.

Thanks for fixing. I no longer see the problems.

Regards,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#222Rajeev rastogi
rajeev.rastogi@huawei.com
In reply to: Amit Kapila (#220)
Re: Parallel Seq Scan

On 20 March 2015 17:37, Amit Kapila Wrote:

So the patches have to be applied in below sequence:
HEAD Commit-id : 8d1f2390
parallel-mode-v8.1.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v11.patch (Attached with this mail)

While I was going through this patch, I observed one invalid ASSERT in the function “ExecInitFunnel” i.e.
Assert(outerPlan(node) == NULL);

Outer node of Funnel node is always non-NULL and currently it will be PartialSeqScan Node.

May be ASSERT is disabled while building the code because of which this issue has not yet been observed.

Thanks and Regards,
Kumar Rajeev Rastogi

#223Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#220)
1 attachment(s)
Re: Parallel Seq Scan

On Fri, Mar 20, 2015 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

So the patches have to be applied in below sequence:
HEAD Commit-id : 8d1f2390
parallel-mode-v8.1.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v11.patch (Attached with this mail)

The reason for not using the latest commit in HEAD is that latest
version of assess-parallel-safety patch was not getting applied,
so I generated the patch at commit-id where I could apply that
patch successfully.

[1] -

/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com

[2] -

/messages/by-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com

[3] -

/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Fixed the reported issue on assess-parallel-safety thread and another
bug caught while testing joins and integrated with latest version of
parallel-mode patch (parallel-mode-v9 patch).

Apart from that I have moved the Initialization of dsm segement from
InitNode phase to ExecFunnel() (on first execution) as per suggestion
from Robert. The main idea is that as it creates large shared memory
segment, so do the work when it is really required.

HEAD Commit-Id: 11226e38
parallel-mode-v9.patch [2]/messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
assess-parallel-safety-v4.patch [1]/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
parallel-heap-scan.patch [3]/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
parallel_seqscan_v12.patch (Attached with this mail)

[1]: /messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
[2]: /messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
/messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
[3]: /messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v12.patchapplication/octet-stream; name=parallel_seqscan_v12.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6370c1f..22b3cc7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1595,6 +1595,20 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 771f6a8..cdf172c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,8 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -916,6 +918,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1065,6 +1073,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1206,6 +1216,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((FunnelState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((FunnelState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1322,6 +1350,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
 			break;
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -1331,6 +1360,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2218,6 +2255,8 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..991ff51 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -16,14 +16,15 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
+       nodeSeqscan.o nodePartialSeqscan.o nodeSetOp.o nodeSort.o \
+       nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o \
+       spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 6ebad2f..10dc319 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -37,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -155,6 +157,14 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -458,6 +468,10 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index d87be96..657b928 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 143c56d..d4c9119 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -181,6 +181,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		estate->es_param_exec_vals = (ParamExecData *)
 			palloc0(queryDesc->plannedstmt->nParamExec * sizeof(ParamExecData));
 
+	estate->toc = queryDesc->toc;
+
 	/*
 	 * If non-read-only query, set the command ID to mark output tuples with
 	 */
@@ -318,6 +320,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	operation = queryDesc->operation;
 	dest = queryDesc->dest;
 
+	/* inform executor to collect buffer usage stats from parallel workers. */
+	estate->total_time = queryDesc->totaltime ? 1 : 0;
+
 	/*
 	 * startup tuple receiver, if we will be emitting tuples
 	 */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..1a1275c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +192,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +418,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +664,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 022041b..79eeaee 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -145,6 +145,8 @@ CreateExecutorState(void)
 
 	estate->es_auxmodifytables = NIL;
 
+	estate->toc = NULL;
+
 	estate->es_per_tuple_exprcontext = NULL;
 
 	estate->es_epqTuple = NULL;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..283a136 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,30 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +167,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..71d4e3f
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,354 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		rescans a relation
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/nodeFunnel.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	 /* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution.
+	 * We do this on first execution rather than during node initialization,
+	 * as it needs to allocate large dynamic segement, so it is better to 
+	 * do if it is really needed.
+	 */
+	if (!node->pcxt)
+	{
+		EState	   *estate = node->ss.ps.state;
+		bool any_worker_launched = false;
+
+		/* Initialize the workers required to perform parallel scan. */
+		InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+								  estate,
+								  node->ss.ss_currentRelation,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are
+		 * not available then we perform the scan with available workers and
+		 * If there are no more workers available, then the funnel node will
+		 * just scan locally.
+		 */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+	
+	slot = funnel_getnext(node);
+	
+	/*
+	 * if required by plugin, aggregate the buffer usage stats
+	 * from all workers.
+	 */
+	if (TupIsNull(slot))
+	{
+		int i;
+		int nworkers;
+		BufferUsage *buffer_usage_worker;
+		char *buffer_usage;
+
+		if (node->ss.ps.state->total_time)
+		{
+			nworkers = node->pcxt->nworkers;
+			buffer_usage = node->buffer_usage_space;
+
+			for (i = 0; i < nworkers; i++)
+			{
+				buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * sizeof(BufferUsage)));
+				BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+			}
+		}
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+
+	if (node->pcxt)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		if (node->fs_workersReady)
+			WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+	}
+}
+
+/*
+ * funnel_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in funnel scan and if there is no
+ *  data available from queues or no worker is available, it does
+ *  fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while ((!funnelstate->all_workers_done  && funnelstate->fs_workersReady) ||
+			!funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans a relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.
+	 */
+	if (node->pcxt)
+	{
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+		node->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+
+	ExecReScan(node->ss.ps.lefttree);
+}
+
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..99cd691
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,319 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/nodePartialSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.  We pass 'toc' (place to lookup parallel scan
+	 * descriptor) via EState for parallel workers whereas master backend
+	 * stores it directly in partial scan state node.
+	 */
+	if (estate->toc)
+		node->ss.ps.toc = estate->toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize
+	 * it during ExecutorStart phase, however we need ParallelHeapScanDesc
+	 * to initialize the scan in case of this node and the same is
+	 * initialized by the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic shared
+		 * memory segment by master backend, parallel workers and local scan by
+		 * master backend retrieve it from shared memory.  If the scan descriptor
+		 * is available on first execution, then we need to re-initialize for
+		 * rescan.
+		 */
+		Assert(node->ss.ps.toc);
+	
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+	{
+		/*HeapScanDesc scan;
+		ParallelHeapScanDesc pscan;
+		EState	   *estate = node->ss.ps.state;
+
+		Assert(estate->toc);
+	
+		pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+		scan = node->ss.ss_currentScanDesc;
+
+		heap_parallel_rescan(pscan, scan);*/
+
+		node->scan_initialized = false;
+	}
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..e4933e6
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static void
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is until the number of remaining queues fall below that pointer and
+		 * at that point make the pointer point to the first queue.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+
+			memmove(&funnel->queue[funnel->nextqueue],
+					&funnel->queue[funnel->nextqueue + 1],
+					sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+
+			if (funnel->nextqueue >= funnel->nqueues)
+				funnel->nextqueue = 0;
+
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+
+			continue;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8c9a0e..3c0123a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -355,6 +355,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4049,6 +4086,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1aa1f55..05d4b3c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -440,6 +440,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2898,6 +2916,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..6b633d6 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,186 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+						num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 563209c..d4570f2 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1280,6 +1280,92 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1409,6 +1495,12 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1a0d358..874c272 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -220,6 +228,55 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info,
+			int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..949e79b
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0
+	 * and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cb69c03..c8422c9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,11 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path,
+								List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +105,12 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist,
+										   List *qpqual,
+										   Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+						   Index scanrelid, int nworkers,
+						   Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +239,8 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +356,20 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path,
+											   tlist,
+											   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +573,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1162,87 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path,
+				   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Plan	   *subplan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3321,6 +3431,45 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1824e7b..4717f78 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -275,6 +275,51 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+								   List *rangetable)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ec828cd..ef8c317 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -435,6 +435,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -445,6 +446,24 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index acfd0bc..f649639 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2167,6 +2167,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index faca30b..0e5fd3a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,53 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+							Path* subpath, int nWorkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nWorkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..925bb7a
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,421 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitializeParallelWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "executor/nodeFunnel.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "postmaster/backendworker.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size);
+static void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters and instrumentation information that need to be
+ * retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the usage along with it's own, so account it for each worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 3);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters and instrumentation information
+ * required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	*paramsdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by
+	 * each worker.
+	 */
+	*buffer_usage_space =
+			shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePartialSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size)
+{
+	/* Estimate space for partial seq. scan specific contents. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StorePartialSeqScan
+ * 
+ * Sets up the planned statement and block range for parallel
+ * sequence scan.
+ */
+void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size)
+{
+	char		*plannedstmtdata;
+	ParallelHeapScanDesc pscan;
+
+	/* Store range table list in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Plan *plan, EState *estate, Relation rel,
+						  char **inst_options_space, char **buffer_usage_space,
+						  shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size, pscan_size, plannedstmt_size;
+	char	   *plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt((PartialSeqScan *)plan,
+													 estate->es_range_table);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePartialSeqScanSpace(pcxt, estate, plannedstmt_str,
+								&plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 estate->es_instrument, &params_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePartialSeqScan(pcxt, estate, rel, plannedstmt_str,
+						plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 estate->es_instrument,
+							 params_size, inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameter's and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char		*paramsdata;
+	char		*inst_options_space;
+	char		*buffer_usage_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+
+	buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+	*buffer_usage = (buffer_usage_space +
+					 ParallelWorkerNumber * sizeof(BufferUsage));
+}
+
+/*
+ * GetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->qual);
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->targetlist);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	int				inst_options;
+	char			*instrument = NULL;
+	char			*buffer_usage = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	GetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &inst_options,
+						   &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->toc = toc;
+	parallelstmt->responseq = responseq;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9b2e7f3..0c6b481 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..7a9ce3e 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -129,6 +130,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +166,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +209,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +254,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c18298..92da4f8 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -55,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1192,6 +1194,98 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	if (parallelstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestTupleQueue);
+		SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	queryDesc->toc = parallelstmt->toc;
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required
+	 * by plugins to report the total usage for statement execution.
+	 */
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0bbc67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -80,6 +80,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->params = params;		/* parameter values passed into query */
 	qd->instrument_options = instrument_options;		/* instrumentation
 														 * wanted? */
+	qd->toc = NULL;		/* need to be set by the caller before ExecutorStart */
 
 	/* null these fields until set by ExecutorStart */
 	qd->tupDesc = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9c74ed3..fc1d639 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -608,6 +608,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2557,6 +2559,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2744,6 +2756,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 110983f..06c5969 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -291,6 +291,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -501,6 +504,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d36e738..0a34b48 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index a2381cd..56b7c75 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,7 @@ typedef struct QueryDesc
 	DestReceiver *dest;			/* the destination for tuple output */
 	ParamListInfo params;		/* param values being passed in */
 	int			instrument_options;		/* OR of InstrumentOption flags */
+	shm_toc		*toc;			/* to fetch the information from dsm */
 
 	/* These fields are set by ExecutorStart */
 	TupleDesc	tupDesc;		/* descriptor for result tuples */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..0d28606 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+	InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..3af3a0e
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cb05be7
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..c979233
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,34 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ac75f86..cd79588 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
 #include "utils/reltrigger.h"
@@ -389,6 +391,18 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
+	 * This is required to collect buffer usage stats from parallel
+	 * workers when requested by plugins.
+	 */
+	bool		total_time;	/* total time spent in ExecutorRun */
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1016,6 +1030,11 @@ typedef struct PlanState
 	 * State for management of parameter-change-driven rescanning
 	 */
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
+	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc			*toc;
 
 	/*
 	 * Other run-time state needed by most if not all node types.
@@ -1216,6 +1235,45 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	char			*buffer_usage_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 38469ef..3f3d572 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +99,8 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +221,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..65b60a0 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c63b1a..6a94190 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,10 +20,15 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
+#include "nodes/params.h"
+#include "nodes/plannodes.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/shm_toc.h"
+#include "storage/shm_mq.h"
 
 /* Possible sources of a Query */
 typedef enum QuerySource
@@ -156,6 +161,17 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	shm_toc			*toc;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+	char			*buffer_usage;
+} ParallelStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 5f0ea1c..7cdf632 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -281,6 +281,22 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 72eb49b..c3e1f6a 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -741,6 +741,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..11f0409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..7873565 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,11 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index b10a504..8d6e350 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+											List *rangetable);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..bf91824
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,40 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define PARALLEL_KEY_BUFF_USAGE		2
+#define PARALLEL_KEY_INST_OPTIONS	3
+#define PARALLEL_KEY_INST_INFO		4
+#define PARALLEL_KEY_TUPLE_QUEUE	5
+#define PARALLEL_KEY_SCAN			6
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Plan *plan, EState *estate,
+									  Relation rel, char **inst_options_space,
+									  char **buffer_usage_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..b560672 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index b3c705f..5c25627 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#224Amit Kapila
amit.kapila16@gmail.com
In reply to: Rajeev rastogi (#222)
Re: Parallel Seq Scan

On Wed, Mar 25, 2015 at 3:47 PM, Rajeev rastogi <rajeev.rastogi@huawei.com>
wrote:

On 20 March 2015 17:37, Amit Kapila Wrote:

So the patches have to be applied in below sequence:

HEAD Commit-id : 8d1f2390
parallel-mode-v8.1.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v11.patch (Attached with this mail)

While I was going through this patch, I observed one invalid ASSERT in

the function “ExecInitFunnel” i.e.

Assert(outerPlan(node) == NULL);

Outer node of Funnel node is always non-NULL and currently it will be

PartialSeqScan Node.

Which version of patch you are looking at?

I am seeing below code in ExecInitFunnel() in Version-11 to which
you have replied.

+ /* Funnel node doesn't have innerPlan node. */
+ Assert(innerPlan(node) == NULL);

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#225Rajeev rastogi
rajeev.rastogi@huawei.com
In reply to: Amit Kapila (#224)
Re: Parallel Seq Scan

On 25 March 2015 16:00, Amit Kapila Wrote:

Which version of patch you are looking at?
I am seeing below code in ExecInitFunnel() in Version-11 to which
you have replied.

+ /* Funnel node doesn't have innerPlan node. */
+ Assert(innerPlan(node) == NULL

I was seeing the version-10.
I just checked version-11 and version-12 and found to be already fixed.
I should have checked the latest version before sending the report…☺

Thanks and Regards,
Kumar Rajeev Rastogi

From: Amit Kapila [mailto:amit.kapila16@gmail.com]
Sent: 25 March 2015 16:00
To: Rajeev rastogi
Cc: Amit Langote; Robert Haas; Andres Freund; Kouhei Kaigai; Amit Langote; Fabrízio Mello; Thom Brown; Stephen Frost; pgsql-hackers
Subject: Re: [HACKERS] Parallel Seq Scan

On Wed, Mar 25, 2015 at 3:47 PM, Rajeev rastogi <rajeev.rastogi@huawei.com<mailto:rajeev.rastogi@huawei.com>> wrote:

On 20 March 2015 17:37, Amit Kapila Wrote:

So the patches have to be applied in below sequence:

HEAD Commit-id : 8d1f2390
parallel-mode-v8.1.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v11.patch (Attached with this mail)

While I was going through this patch, I observed one invalid ASSERT in the function “ExecInitFunnel” i.e.

Assert(outerPlan(node) == NULL);

Outer node of Funnel node is always non-NULL and currently it will be PartialSeqScan Node.

Which version of patch you are looking at?

I am seeing below code in ExecInitFunnel() in Version-11 to which
you have replied.

+ /* Funnel node doesn't have innerPlan node. */
+ Assert(innerPlan(node) == NULL);

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com&lt;http://www.enterprisedb.com/&gt;

#226Amit Kapila
amit.kapila16@gmail.com
In reply to: Rajeev rastogi (#225)
Re: Parallel Seq Scan

On Wed, Mar 25, 2015 at 4:08 PM, Rajeev rastogi <rajeev.rastogi@huawei.com>
wrote:

On 25 March 2015 16:00, Amit Kapila Wrote:

Which version of patch you are looking at?

I am seeing below code in ExecInitFunnel() in Version-11 to which

you have replied.

+ /* Funnel node doesn't have innerPlan node. */
+ Assert(innerPlan(node) == NULL

I was seeing the version-10.

I just checked version-11 and version-12 and found to be already fixed.

I should have checked the latest version before sending the report…J

No problem, Thanks for looking into the patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#227Thom Brown
thom@linux.com
In reply to: Amit Kapila (#223)
Re: Parallel Seq Scan

On 25 March 2015 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 20, 2015 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

So the patches have to be applied in below sequence:
HEAD Commit-id : 8d1f2390
parallel-mode-v8.1.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v11.patch (Attached with this mail)

The reason for not using the latest commit in HEAD is that latest
version of assess-parallel-safety patch was not getting applied,
so I generated the patch at commit-id where I could apply that
patch successfully.

[1] -

/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com

[2] -

/messages/by-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com

[3] -

/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Fixed the reported issue on assess-parallel-safety thread and another
bug caught while testing joins and integrated with latest version of
parallel-mode patch (parallel-mode-v9 patch).

Apart from that I have moved the Initialization of dsm segement from
InitNode phase to ExecFunnel() (on first execution) as per suggestion
from Robert. The main idea is that as it creates large shared memory
segment, so do the work when it is really required.

HEAD Commit-Id: 11226e38
parallel-mode-v9.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v12.patch (Attached with this mail)

[1] -
/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
[2] -
/messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
[3] -
/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Okay, with my pgbench_accounts partitioned into 300, I ran:

SELECT DISTINCT bid FROM pgbench_accounts;

The query never returns, and I also get this:

grep -r 'starting background worker process "parallel worker for PID
12165"' postgresql-2015-03-25_112522.log | wc -l
2496

2,496 workers? This is with parallel_seqscan_degree set to 8. If I set it
to 2, this number goes down to 626, and with 16, goes up to 4320.

Here's the query plan:

QUERY
PLAN
---------------------------------------------------------------------------------------------------------
HashAggregate (cost=38856527.50..38856529.50 rows=200 width=4)
Group Key: pgbench_accounts.bid
-> Append (cost=0.00..38806370.00 rows=20063001 width=4)
-> Seq Scan on pgbench_accounts (cost=0.00..0.00 rows=1 width=4)
-> Funnel on pgbench_accounts_1 (cost=0.00..192333.33
rows=100000 width=4)
Number of Workers: 8
-> Partial Seq Scan on pgbench_accounts_1
(cost=0.00..1641000.00 rows=100000 width=4)
-> Funnel on pgbench_accounts_2 (cost=0.00..192333.33
rows=100000 width=4)
Number of Workers: 8
-> Partial Seq Scan on pgbench_accounts_2
(cost=0.00..1641000.00 rows=100000 width=4)
-> Funnel on pgbench_accounts_3 (cost=0.00..192333.33
rows=100000 width=4)
Number of Workers: 8
...
-> Partial Seq Scan on pgbench_accounts_498
(cost=0.00..10002.10 rows=210 width=4)
-> Funnel on pgbench_accounts_499 (cost=0.00..1132.34 rows=210
width=4)
Number of Workers: 8
-> Partial Seq Scan on pgbench_accounts_499
(cost=0.00..10002.10 rows=210 width=4)
-> Funnel on pgbench_accounts_500 (cost=0.00..1132.34 rows=210
width=4)
Number of Workers: 8
-> Partial Seq Scan on pgbench_accounts_500
(cost=0.00..10002.10 rows=210 width=4)

Still not sure why 8 workers are needed for each partial scan. I would
expect 8 workers to be used for 8 separate scans. Perhaps this is just my
misunderstanding of how this feature works.

--
Thom

#228Thom Brown
thom@linux.com
In reply to: Thom Brown (#227)
Re: Parallel Seq Scan

On 25 March 2015 at 11:46, Thom Brown <thom@linux.com> wrote:

Still not sure why 8 workers are needed for each partial scan. I would
expect 8 workers to be used for 8 separate scans. Perhaps this is just my
misunderstanding of how this feature works.

Another issue:

SELECT * FROM pgb<tab>

*crash*

Logs:

2015-03-25 13:17:49 GMT [22823]: [124-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [125-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [126-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [127-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [128-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [129-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [130-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [131-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [132-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [133-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [134-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [135-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [136-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [137-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [138-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [139-1] user=,db=,client= LOG: starting
background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [140-1] user=,db=,client= LOG: worker
process: parallel worker for PID 24792 (PID 24804) was terminated by signal
11: Segmentation fault
2015-03-25 13:17:49 GMT [22823]: [141-1] user=,db=,client= LOG:
terminating any other active server processes
2015-03-25 13:17:49 GMT [24777]: [2-1] user=,db=,client= WARNING:
terminating connection because of crash of another server process
2015-03-25 13:17:49 GMT [24777]: [3-1] user=,db=,client= DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
2015-03-25 13:17:49 GMT [24777]: [4-1] user=,db=,client= HINT: In a moment
you should be able to reconnect to the database and repeat your command.

Backtrace:

#0 GrantLockLocal (locallock=locallock@entry=0xfbe7f0,
owner=owner@entry=0x1046da0)
at lock.c:1544
#1 0x000000000066975c in LockAcquireExtended
(locktag=locktag@entry=0x7fffdcb0ea20,
lockmode=1,
lockmode@entry=<error reading variable: Cannot access memory at address
0x7fffdcb0e9f0>, sessionLock=sessionLock@entry=0 '\000',
dontWait=dontWait@entry=0 '\000',
reportMemoryError=reportMemoryError@entry=1 '\001', ) at lock.c:798
#2 0x000000000066a1c4 in LockAcquire (locktag=locktag@entry=0x7fffdcb0ea20,
lockmode=<error reading variable: Cannot access memory at address
0x7fffdcb0e9f0>,
sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0
'\000') at lock.c:680
#3 0x0000000000667c48 in LockRelationOid (relid=<error reading variable:
Cannot access memory at address 0x7fffdcb0e9e8>,
relid@entry=<error reading variable: Cannot access memory at address
0x7fffdcb0ea48>,
lockmode=<error reading variable: Cannot access memory at address
0x7fffdcb0e9f0>,
lockmode@entry=<error reading variable: Cannot access memory at address
0x7fffdcb0ea48>) at lmgr.c:94

But the issue seems to produce a different backtrace each time...

2nd backtrace:

#0 hash_search_with_hash_value (hashp=0x2a2c370,
keyPtr=keyPtr@entry=0x7ffff5ad2230,
hashvalue=hashvalue@entry=2114233864, action=action@entry=HASH_FIND,
foundPtr=foundPtr@entry=0x0) at dynahash.c:918
#1 0x0000000000654d1a in BufTableLookup (tagPtr=tagPtr@entry=0x7ffff5ad2230,
hashcode=hashcode@entry=2114233864) at buf_table.c:96
#2 0x000000000065746b in BufferAlloc (foundPtr=0x7ffff5ad222f <Address
0x7ffff5ad222f out of bounds>, strategy=0x0,
blockNum=<error reading variable: Cannot access memory at address
0x7ffff5ad2204>,
forkNum=<error reading variable: Cannot access memory at address
0x7ffff5ad2208>,
relpersistence=<error reading variable: Cannot access memory at address
0x7ffff5ad2214>, smgr=0x2aaae00) at bufmgr.c:893
#3 ReadBuffer_common (smgr=0x2aaae00, relpersistence=<optimized out>, ) at
bufmgr.c:641
#4 0x0000000000657e40 in ReadBufferExtended (reln=<error reading variable:
Cannot access memory at address 0x7ffff5ad2278>,
reln@entry=<error reading variable: Cannot access memory at address
0x7ffff5ad22f8>, forkNum=MAIN_FORKNUM, blockNum=6, mode=<optimized out>,
strategy=<optimized out>) at bufmgr.c:560

3rd backtrace:

#0 hash_search_with_hash_value (hashp=0x1d97370,
keyPtr=keyPtr@entry=0x7ffff95855f0,
hashvalue=hashvalue@entry=2382868486, action=action@entry=HASH_FIND,
foundPtr=foundPtr@entry=0x0) at dynahash.c:907
#1 0x0000000000654d1a in BufTableLookup (tagPtr=tagPtr@entry=0x7ffff95855f0,
hashcode=hashcode@entry=2382868486) at buf_table.c:96
#2 0x000000000065746b in BufferAlloc (foundPtr=0x7ffff95855ef "",
strategy=0x0, blockNum=9, forkNum=MAIN_FORKNUM, relpersistence=112 'p',
smgr=0x1e15860)
at bufmgr.c:893
#3 ReadBuffer_common (smgr=0x1e15860, relpersistence=<optimized out>,
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=9,
mode=RBM_NORMAL, strategy=0x0,
hit=hit@entry=0x7ffff958569f "") at bufmgr.c:641
#4 0x0000000000657e40 in ReadBufferExtended (reln=reln@entry=0x7f8a17bab2c0,
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=9, mode=mode@entry=RBM_NORMAL,
strategy=strategy@entry=0x0) at bufmgr.c:560
#5 0x0000000000657f4d in ReadBuffer (blockNum=<optimized out>,
reln=0x7f8a17bab2c0) at bufmgr.c:492
#6 ReleaseAndReadBuffer (buffer=buffer@entry=398111424,
relation=relation@entry=0x1, blockNum=<optimized out>) at bufmgr.c:1403
#7 0x000000000049e6bf in _bt_relandgetbuf (rel=0x1, rel@entry=0x7f8a17bab2c0,
obuf=398111424, blkno=blkno@entry=9, access=access@entry=1) at nbtpage.c:707
#8 0x00000000004a24b4 in _bt_search (rel=rel@entry=0x7f8a17bab2c0,
keysz=keysz@entry=2, scankey=scankey@entry=0x7ffff95858b0,
nextkey=nextkey@entry=0 '\000',
bufP=bufP@entry=0x7ffff95857ac, access=access@entry=1) at
nbtsearch.c:131
#9 0x00000000004a2cb4 in _bt_first (scan=scan@entry=0x1eb2048,
dir=dir@entry=ForwardScanDirection) at nbtsearch.c:940
#10 0x00000000004a1141 in btgettuple (fcinfo=<optimized out>) at
nbtree.c:288
#11 0x0000000000759132 in FunctionCall2Coll (flinfo=flinfo@entry=0x1e34390,
collation=collation@entry=0, arg1=arg1@entry=32186440, arg2=arg2@entry=1)
at fmgr.c:1323
#12 0x000000000049b273 in index_getnext_tid (scan=scan@entry=0x1eb2048,
direction=direction@entry=ForwardScanDirection) at indexam.c:462
#13 0x000000000049b450 in index_getnext (scan=0x1eb2048,
direction=direction@entry=ForwardScanDirection) at indexam.c:602
#14 0x000000000049a9a9 in systable_getnext (sysscan=sysscan@entry=0x1eb1ff8)
at genam.c:416
#15 0x0000000000740452 in SearchCatCache (cache=0x1ddf540, v1=<optimized
out>, v2=<optimized out>, v3=<optimized out>, v4=<optimized out>) at
catcache.c:1248
#16 0x000000000074bd06 in GetSysCacheOid (cacheId=cacheId@entry=44,
key1=key1@entry=140226851237264, key2=<optimized out>, key3=key3@entry=0,
key4=key4@entry=0)
at syscache.c:988
#17 0x000000000074d674 in get_relname_relid
(relname=relname@entry=0x7f891ba7ed90
"pgbench_accounts_3", relnamespace=<optimized out>) at lsyscache.c:1602
#18 0x00000000004e1228 in RelationIsVisible (relid=relid@entry=16428) at
namespace.c:740
#19 0x00000000004e4b6f in pg_table_is_visible (fcinfo=0x1e9dfc8) at
namespace.c:4078
#20 0x0000000000595f72 in ExecMakeFunctionResultNoSets (fcache=0x1e9df58,
econtext=0x1e99848, isNull=0x7ffff95871bf "", isDone=<optimized out>) at
execQual.c:2015
#21 0x000000000059b469 in ExecQual (qual=qual@entry=0x1e9b368,
econtext=econtext@entry=0x1e99848, resultForNull=resultForNull@entry=0
'\000') at execQual.c:5206
#22 0x000000000059b9a6 in ExecScan (node=node@entry=0x1e99738,
accessMtd=accessMtd@entry=0x5ad780 <PartialSeqNext>,
recheckMtd=recheckMtd@entry=0x5ad770 <PartialSeqRecheck>) at
execScan.c:195
#23 0x00000000005ad8d0 in ExecPartialSeqScan (node=node@entry=0x1e99738) at
nodePartialSeqscan.c:241
#24 0x0000000000594f68 in ExecProcNode (node=0x1e99738) at
execProcnode.c:422
#25 0x00000000005a39b6 in funnel_getnext (funnelstate=0x1e943c8) at
nodeFunnel.c:308
#26 ExecFunnel (node=node@entry=0x1e943c8) at nodeFunnel.c:185
#27 0x0000000000594f58 in ExecProcNode (node=0x1e943c8) at
execProcnode.c:426
#28 0x00000000005a0212 in ExecAppend (node=node@entry=0x1e941d8) at
nodeAppend.c:209
#29 0x0000000000594fa8 in ExecProcNode (node=node@entry=0x1e941d8) at
execProcnode.c:399
#30 0x00000000005a0c9e in agg_fill_hash_table (aggstate=0x1e93ba8) at
nodeAgg.c:1353
#31 ExecAgg (node=node@entry=0x1e93ba8) at nodeAgg.c:1115
#32 0x0000000000594e38 in ExecProcNode (node=node@entry=0x1e93ba8) at
execProcnode.c:506
#33 0x00000000005a8144 in ExecLimit (node=node@entry=0x1e93908) at
nodeLimit.c:91
#34 0x0000000000594d98 in ExecProcNode (node=node@entry=0x1e93908) at
execProcnode.c:530
#35 0x0000000000592380 in ExecutePlan (dest=0x7f891bbc9f10,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x1e93908,
#36 standard_ExecutorRun (queryDesc=0x1dbb800, direction=<optimized out>,
count=0) at execMain.c:342
#37 0x000000000067e9a8 in PortalRunSelect (portal=0x1e639e0,
portal@entry=<error
reading variable: Cannot access memory at address 0x7ffff95874c8>,
forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:947

4th backtrace:

#0 ScanKeywordLookup (text=text@entry=0x1d57fa0
"information_schema_catalog_name", keywords=0x84f220 <ScanKeywords>,
num_keywords=408) at kwlookup.c:64
#1 0x000000000070aa14 in quote_identifier (ident=0x1d57fa0
"information_schema_catalog_name") at ruleutils.c:9009
#2 0x00000000006f54bd in quote_ident (fcinfo=<optimized out>) at quote.c:31
#3 0x0000000000595f72 in ExecMakeFunctionResultNoSets (fcache=0x1d42cb8,
econtext=0x1d3f848, isNull=0x1d42858 "", isDone=<optimized out>) at
execQual.c:2015
#4 0x0000000000595f1d in ExecMakeFunctionResultNoSets (fcache=0x1d424a8,
econtext=0x1d3f848, isNull=0x1d42048 "", isDone=<optimized out>) at
execQual.c:1989
#5 0x0000000000595f1d in ExecMakeFunctionResultNoSets (fcache=0x1d41c98,
econtext=0x1d3f848, isNull=0x7fff0bdc61df "", isDone=<optimized out>) at
execQual.c:1989
#6 0x000000000059b469 in ExecQual (qual=qual@entry=0x1d41368,
econtext=econtext@entry=0x1d3f848, resultForNull=resultForNull@entry=0
'\000') at execQual.c:5206
#7 0x000000000059b9a6 in ExecScan (node=node@entry=0x1d3f738,
accessMtd=accessMtd@entry=0x5ad780 <PartialSeqNext>,
recheckMtd=recheckMtd@entry=0x5ad770 <PartialSeqRecheck>) at
execScan.c:195
#8 0x00000000005ad8d0 in ExecPartialSeqScan (node=node@entry=0x1d3f738) at
nodePartialSeqscan.c:241
#9 0x0000000000594f68 in ExecProcNode (node=0x1d3f738) at
execProcnode.c:422
#10 0x00000000005a39b6 in funnel_getnext (funnelstate=0x1d3a3c8) at
nodeFunnel.c:308
#11 ExecFunnel (node=node@entry=0x1d3a3c8) at nodeFunnel.c:185
#12 0x0000000000594f58 in ExecProcNode (node=0x1d3a3c8) at
execProcnode.c:426
#13 0x00000000005a0212 in ExecAppend (node=node@entry=0x1d3a1d8) at
nodeAppend.c:209
#14 0x0000000000594fa8 in ExecProcNode (node=node@entry=0x1d3a1d8) at
execProcnode.c:399
#15 0x00000000005a0c9e in agg_fill_hash_table (aggstate=0x1d39ba8) at
nodeAgg.c:1353
#16 ExecAgg (node=node@entry=0x1d39ba8) at nodeAgg.c:1115
#17 0x0000000000594e38 in ExecProcNode (node=node@entry=0x1d39ba8) at
execProcnode.c:506
#18 0x00000000005a8144 in ExecLimit (node=node@entry=0x1d39908) at
nodeLimit.c:91
#19 0x0000000000594d98 in ExecProcNode (node=node@entry=0x1d39908) at
execProcnode.c:530
#20 0x0000000000592380 in ExecutePlan (dest=0x7fe8c8a1cf10,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x1d39908,
estate=0x1d01990) at execMain.c:1533
#21 standard_ExecutorRun (queryDesc=0x1c61800, direction=<optimized out>,
count=0) at execMain.c:342
#22 0x000000000067e9a8 in PortalRunSelect (portal=portal@entry=0x1d099e0,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x7fe8c8a1cf10) at pquery.c:947
#23 0x000000000067fd0f in PortalRun (portal=portal@entry=0x1d099e0,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001',
dest=dest@entry=0x7fe8c8a1cf10, altdest=altdest@entry=0x7fe8c8a1cf10,
completionTag=completionTag@entry=0x7fff0bdc6790 "") at pquery.c:791
#24 0x000000000067dab8 in exec_simple_query (
query_string=0x1caf750 "SELECT pg_catalog.quote_ident(c.relname) FROM
pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f') AND
substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND
pg_catalog.pg_table_is_v"...) at postgres.c:1107
#25 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1c3db60,
dbname=0x1c3da18 "pgbench", username=<optimized out>) at postgres.c:4120
#26 0x0000000000462c8e in BackendRun (port=0x1c621f0) at postmaster.c:4148
#27 BackendStartup (port=0x1c621f0) at postmaster.c:3833
#28 ServerLoop () at postmaster.c:1601
#29 0x000000000062e803 in PostmasterMain (argc=argc@entry=1,
argv=argv@entry=0x1c3cca0)
at postmaster.c:1248
#30 0x00000000004636dd in main (argc=1, argv=0x1c3cca0) at main.c:221

5th backtrace:

#0 0x000000000075d757 in hash_search_with_hash_value (hashp=0x1d62310,
keyPtr=keyPtr@entry=0x7fffb686f4a0, hashvalue=hashvalue@entry=171639189,
action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7fffb686f44f
<Address 0x7fffb686f44f out of bounds>) at dynahash.c:1026
#1 0x0000000000654d52 in BufTableInsert (tagPtr=tagPtr@entry=0x7fffb686f4a0,
hashcode=hashcode@entry=171639189, buf_id=169) at buf_table.c:128
#2 0x0000000000657711 in BufferAlloc (foundPtr=0x7fffb686f49f <Address
0x7fffb686f49f out of bounds>, strategy=0x0, blockNum=11,
forkNum=MAIN_FORKNUM,
relpersistence=<error reading variable: Cannot access memory at address
0x7fffb686f484>,
smgr=<error reading variable: Cannot access memory at address
0x7fffb686f488>) at bufmgr.c:1089
#3 ReadBuffer_common (smgr=<error reading variable: Cannot access memory
at address 0x7fffb686f488>, relpersistence=<optimized out>,
forkNum=MAIN_FORKNUM,
forkNum@entry=<error reading variable: Cannot access memory at address
0x7fffb686f4f0>, blockNum=11,
blockNum@entry=<error reading variable: Cannot access memory at address
0x7fffb686f4f8>, mode=RBM_NORMAL, strategy=0x0,
hit=hit@entry=0x7fffb686f54f <Address 0x7fffb686f54f out of bounds>) at
bufmgr.c:641
#4 0x0000000000657e40 in ReadBufferExtended (reln=<error reading variable:
Cannot access memory at address 0x7fffb686f4e8>,
reln@entry=<error reading variable: Cannot access memory at address
0x7fffb686f568>,
forkNum=<error reading variable: Cannot access memory at address
0x7fffb686f4f0>,
blockNum=<error reading variable: Cannot access memory at address
0x7fffb686f4f8>, mode=<optimized out>, strategy=<optimized out>) at
bufmgr.c:560
--
Thom

#229Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#227)
Re: Parallel Seq Scan

On Wed, Mar 25, 2015 at 5:16 PM, Thom Brown <thom@linux.com> wrote:

On 25 March 2015 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed the reported issue on assess-parallel-safety thread and another
bug caught while testing joins and integrated with latest version of
parallel-mode patch (parallel-mode-v9 patch).

Apart from that I have moved the Initialization of dsm segement from
InitNode phase to ExecFunnel() (on first execution) as per suggestion
from Robert. The main idea is that as it creates large shared memory
segment, so do the work when it is really required.

HEAD Commit-Id: 11226e38
parallel-mode-v9.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v12.patch (Attached with this mail)

[1] -

/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com

[2] -

/messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com

[3] -

/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Okay, with my pgbench_accounts partitioned into 300, I ran:

SELECT DISTINCT bid FROM pgbench_accounts;

The query never returns,

You seem to be hitting the issue I have pointed in near-by thread [1]/messages/by-id/CAA4eK1+NwUJ9ik61yGfZBcN85dQuNEvd38_h1zngCdZrGLGQTQ@mail.gmail.com
and I have mentioned the same while replying on assess-parallel-safety
thread. Can you check after applying the patch in mail [1]/messages/by-id/CAA4eK1+NwUJ9ik61yGfZBcN85dQuNEvd38_h1zngCdZrGLGQTQ@mail.gmail.com

and I also get this:

grep -r 'starting background worker process "parallel worker for PID

12165"' postgresql-2015-03-25_112522.log | wc -l

2496

2,496 workers? This is with parallel_seqscan_degree set to 8. If I set

it to 2, this number goes down to 626, and with 16, goes up to 4320.

..

Still not sure why 8 workers are needed for each partial scan. I would

expect 8 workers to be used for 8 separate scans. Perhaps this is just my
misunderstanding of how this feature works.

The reason is that for each table scan, it tries to use workers
equal to parallel_seqscan_degree if they are available and in this
case as the scan for inheritance hierarchy (tables in hierarchy) happens
one after another, it uses 8 workers for each scan. I think as of now
the strategy to decide number of workers to be used in scan is kept
simple and in future we can try to come with some better mechanism
to decide number of workers.

[1]: /messages/by-id/CAA4eK1+NwUJ9ik61yGfZBcN85dQuNEvd38_h1zngCdZrGLGQTQ@mail.gmail.com
/messages/by-id/CAA4eK1+NwUJ9ik61yGfZBcN85dQuNEvd38_h1zngCdZrGLGQTQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#230Thom Brown
thom@linux.com
In reply to: Amit Kapila (#229)
Re: Parallel Seq Scan

On 25 March 2015 at 15:49, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 25, 2015 at 5:16 PM, Thom Brown <thom@linux.com> wrote:

On 25 March 2015 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed the reported issue on assess-parallel-safety thread and another
bug caught while testing joins and integrated with latest version of
parallel-mode patch (parallel-mode-v9 patch).

Apart from that I have moved the Initialization of dsm segement from
InitNode phase to ExecFunnel() (on first execution) as per suggestion
from Robert. The main idea is that as it creates large shared memory
segment, so do the work when it is really required.

HEAD Commit-Id: 11226e38
parallel-mode-v9.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v12.patch (Attached with this mail)

[1] -

/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com

[2] -

/messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com

[3] -

/messages/by-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Okay, with my pgbench_accounts partitioned into 300, I ran:

SELECT DISTINCT bid FROM pgbench_accounts;

The query never returns,

You seem to be hitting the issue I have pointed in near-by thread [1]
and I have mentioned the same while replying on assess-parallel-safety
thread. Can you check after applying the patch in mail [1]

Ah, okay, here's the patches I've now applied:

parallel-mode-v9.patch
assess-parallel-safety-v4.patch
parallel-heap-scan.patch
parallel_seqscan_v12.patch
release_lock_dsm_v1.patch

(with perl patch for pg_proc.h)

The query now returns successfully.

and I also get this:

grep -r 'starting background worker process "parallel worker for PID

12165"' postgresql-2015-03-25_112522.log | wc -l

2496

2,496 workers? This is with parallel_seqscan_degree set to 8. If I set

it to 2, this number goes down to 626, and with 16, goes up to 4320.

..

Still not sure why 8 workers are needed for each partial scan. I would

expect 8 workers to be used for 8 separate scans. Perhaps this is just my
misunderstanding of how this feature works.

The reason is that for each table scan, it tries to use workers
equal to parallel_seqscan_degree if they are available and in this
case as the scan for inheritance hierarchy (tables in hierarchy) happens
one after another, it uses 8 workers for each scan. I think as of now
the strategy to decide number of workers to be used in scan is kept
simple and in future we can try to come with some better mechanism
to decide number of workers.

Yes, I was expecting the parallel aspect to apply across partitions (a
worker per partition up to parallel_seqscan_degree and reallocate to
another scan once finished with current job), not individual ones, so for
the workers to be above the funnel, not below it. So this is
parallelising, just not in a way that will be a win in this case. :( For
the query I posted (SELECT DISTINCT bid FROM pgbench_partitions), the
parallelised version takes 8 times longer to complete. However, I'm
perhaps premature in what I expect from the feature at this stage.

--
Thom

#231Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#230)
Re: Parallel Seq Scan

On Wed, Mar 25, 2015 at 9:53 PM, Thom Brown <thom@linux.com> wrote:

On 25 March 2015 at 15:49, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 25, 2015 at 5:16 PM, Thom Brown <thom@linux.com> wrote:

Okay, with my pgbench_accounts partitioned into 300, I ran:

SELECT DISTINCT bid FROM pgbench_accounts;

The query never returns,

You seem to be hitting the issue I have pointed in near-by thread [1]
and I have mentioned the same while replying on assess-parallel-safety
thread. Can you check after applying the patch in mail [1]

Ah, okay, here's the patches I've now applied:

parallel-mode-v9.patch
assess-parallel-safety-v4.patch
parallel-heap-scan.patch
parallel_seqscan_v12.patch
release_lock_dsm_v1.patch

(with perl patch for pg_proc.h)

The query now returns successfully.

Thanks for verification.

..

Still not sure why 8 workers are needed for each partial scan. I

would expect 8 workers to be used for 8 separate scans. Perhaps this is
just my misunderstanding of how this feature works.

The reason is that for each table scan, it tries to use workers
equal to parallel_seqscan_degree if they are available and in this
case as the scan for inheritance hierarchy (tables in hierarchy) happens
one after another, it uses 8 workers for each scan. I think as of now
the strategy to decide number of workers to be used in scan is kept
simple and in future we can try to come with some better mechanism
to decide number of workers.

Yes, I was expecting the parallel aspect to apply across partitions (a

worker per partition up to parallel_seqscan_degree and reallocate to
another >scan once finished with current job), not individual ones,

Here what you are describing is something like parallel partition
scan which is somewhat related but different feature. This
feature will parallelize the scan for an individual table.

so for the workers to be above the funnel, not below it. So this is

parallelising, just not in a way that will be a win in this case. :( For
the query I

posted (SELECT DISTINCT bid FROM pgbench_partitions), the parallelised

version takes 8 times longer to complete.

I think the primary reason for it not performing as per expectation is
because we have either not the set the right values for cost
parameters or changed the existing cost parameters (cost_seq_page)
which makes planner to select parallel plan even though it is costly.

This is similar to the behaviour when user has intentionally disabled
index scan to test sequence scan and then telling that it is performing
slower.

I think if you want to help in this direction, then what will be more useful
is to see what could be the appropriate values of cost parameters for
parallel scan. We have introduced 3 parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for costing of parallel plans,
so
with your tests if we can decide what is the appropriate value for each of
these parameters such that it chooses parallel plan only when it is better
than non-parallel plan, then that will be really valuable input.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#232Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#228)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Mar 25, 2015 at 7:09 PM, Thom Brown <thom@linux.com> wrote:

On 25 March 2015 at 11:46, Thom Brown <thom@linux.com> wrote:

Still not sure why 8 workers are needed for each partial scan. I would

expect 8 workers to be used for 8 separate scans. Perhaps this is just my
misunderstanding of how this feature works.

Another issue:

SELECT * FROM pgb<tab>

*crash*

The reason of this problem is that above tab-completion is executing
query [1]SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog') UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring (pg_catalog.quote_ident(n.nspname) || '.',1,3)='pgb' AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring ('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,3)='pgb' AND substring(pg_catalog.quote_ident (n.nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1) AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) = 1 LIMIT 1000; which contains subplan for the funnel node and currently
we don't have capability (enough infrastructure) to support execution
of subplans by parallel workers. Here one might wonder why we
have choosen Parallel Plan (Funnel node) for such a case and the
reason for same is that subplans are attached after Plan generation
(SS_finalize_plan()) and if want to discard such a plan, it will be
much more costly, tedious and not worth the effort as we have to
eventually make such a plan work.

Here we have two choices to proceed, first one is to support execution
of subplans by parallel workers and second is execute/scan locally for
Funnel node having subplan (don't launch workers).

I have tried to evaluate what it would take us to support execution
of subplans by parallel workers. We need to pass the sub plans
stored in Funnel Node (initPlan) and corresponding subplans stored
in planned statement (subplans) as subplan's stored in Funnel node
has reference to subplans in planned statement. Next currently
readfuncs.c (functions to read different type of nodes) doesn't support
reading any type of plan node, so we need to add support for reading all
kind
of plan nodes (as subplan can have any type of plan node) and similarly
to execute any type of Plan node, we might need more work (infrastructure).

Currently I have updated the patch to use second approach which
is to execute/scan locally for Funnel node having subplan.

I understand that it is quite interesting if we can have support for
execution of subplans (un-correlated expression subselects) by
parallel workers, but I feel it is better done as a separate patch.

[1]: SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog') UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring (pg_catalog.quote_ident(n.nspname) || '.',1,3)='pgb' AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring ('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,3)='pgb' AND substring(pg_catalog.quote_ident (n.nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1) AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) = 1 LIMIT 1000;
SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE
c.relkind IN ('r', 'S', 'v', 'm',
'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND
pg_catalog.pg_table_is_visible(c.oid) AND
c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname =
'pg_catalog') UNION SELECT
pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n
WHERE substring
(pg_catalog.quote_ident(n.nspname) || '.',1,3)='pgb' AND (SELECT
pg_catalog.count(*) FROM
pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) ||
'.',1,3) = substring
('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION
SELECT pg_catalog.quote_ident
(n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM
pg_catalog.pg_class c, pg_catalog.pg_namespace n
WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f') AND
substring(pg_catalog.quote_ident
(n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,3)='pgb' AND
substring(pg_catalog.quote_ident
(n.nspname) || '.',1,3) =
substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1)
AND
(SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
substring(pg_catalog.quote_ident(nspname) ||
'.',1,3) =
substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) =
1 LIMIT 1000;

Query Plan
--------------------------
QUERY PLAN

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
----------------------------------------------
Limit (cost=10715.89..10715.92 rows=3 width=85)
-> HashAggregate (cost=10715.89..10715.92 rows=3 width=85)
Group Key: (quote_ident((c.relname)::text))
-> Append (cost=8.15..10715.88 rows=3 width=85)
-> Funnel on pg_class c (cost=8.15..9610.67 rows=1
width=64)
Filter: ((relnamespace <> $4) AND (relkind = ANY
('{r,S,v,m
,f}'::"char"[])) AND ("substring"(quote_ident((relname)::text), 1, 3) =
'pgb'::t
ext) AND pg_table_is_visible(oid))
Number of Workers: 1
InitPlan 3 (returns $4)
-> Index Scan using pg_namespace_nspname_index on
pg_nam
espace pg_namespace_2 (cost=0.13..8.15 rows=1 width=4)
Index Cond: (nspname = 'pg_catalog'::name)
-> Partial Seq Scan on pg_class c
(cost=0.00..19043.43 ro
ws=1 width=64)
Filter: ((relnamespace <> $4) AND (relkind = ANY
('{r
,S,v,m,f}'::"char"[])) AND ("substring"(quote_ident((relname)::text), 1, 3)
= 'p
gb'::text) AND pg_table_is_visible(oid))
-> Result (cost=8.52..16.69 rows=1 width=64)
One-Time Filter: ($3 > 1)
InitPlan 2 (returns $3)
-> Aggregate (cost=8.37..8.38 rows=1 width=0)
-> Index Only Scan using
pg_namespace_nspname_inde
x on pg_namespace pg_namespace_1 (cost=0.13..8.37 rows=1 width=0)
Filter:
("substring"((quote_ident((nspname)::
text) || '.'::text), 1, 3) = "substring"('pgb'::text, 1,
(length(quote_ident((ns
pname)::text)) + 1)))
-> Index Only Scan using pg_namespace_nspname_index
on pg_
namespace n (cost=0.13..8.30 rows=1 width=64)
Filter:
("substring"((quote_ident((nspname)::text) ||
'.'::text), 1, 3) = 'pgb'::text)
-> Result (cost=8.79..1088.49 rows=1 width=128)
One-Time Filter: ($0 = 1)
InitPlan 1 (returns $0)
-> Aggregate (cost=8.37..8.38 rows=1 width=0)
-> Index Only Scan using
pg_namespace_nspname_inde
x on pg_namespace (cost=0.13..8.37 rows=1 width=0)
Filter:
("substring"((quote_ident((nspname)::
text) || '.'::text), 1, 3) = "substring"('pgb'::text, 1,
(length(quote_ident((ns
pname)::text)) + 1)))
-> Nested Loop (cost=0.41..1080.09 rows=1 width=128)
-> Index Scan using pg_namespace_oid_index on
pg_nam
espace n_1 (cost=0.13..12.37 rows=1 width=68)
Filter:
("substring"((quote_ident((nspname)::te
xt) || '.'::text), 1, 3) = "substring"('pgb'::text, 1,
(length(quote_ident((nspn
ame)::text)) + 1)))
-> Index Scan using pg_class_relname_nsp_index
on pg
_class c_1 (cost=0.28..1067.71 rows=1 width=68)
Index Cond: (relnamespace = n_1.oid)
Filter: ((relkind = ANY
('{r,S,v,m,f}'::"char"[
])) AND ("substring"(((quote_ident((n_1.nspname)::text) || '.'::text) ||
quote_i
dent((relname)::text)), 1, 3) = 'pgb'::text))
(32 rows)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v13.patchapplication/octet-stream; name=parallel_seqscan_v13.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6370c1f..22b3cc7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1595,6 +1595,20 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 771f6a8..cdf172c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,8 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -916,6 +918,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1065,6 +1073,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1206,6 +1216,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((FunnelState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((FunnelState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1322,6 +1350,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
 			break;
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -1331,6 +1360,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2218,6 +2255,8 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..991ff51 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -16,14 +16,15 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
+       nodeSeqscan.o nodePartialSeqscan.o nodeSetOp.o nodeSort.o \
+       nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o \
+       spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 6ebad2f..10dc319 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -37,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -155,6 +157,14 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -458,6 +468,10 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index d87be96..657b928 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 143c56d..d4c9119 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -181,6 +181,8 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		estate->es_param_exec_vals = (ParamExecData *)
 			palloc0(queryDesc->plannedstmt->nParamExec * sizeof(ParamExecData));
 
+	estate->toc = queryDesc->toc;
+
 	/*
 	 * If non-read-only query, set the command ID to mark output tuples with
 	 */
@@ -318,6 +320,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	operation = queryDesc->operation;
 	dest = queryDesc->dest;
 
+	/* inform executor to collect buffer usage stats from parallel workers. */
+	estate->total_time = queryDesc->totaltime ? 1 : 0;
+
 	/*
 	 * startup tuple receiver, if we will be emitting tuples
 	 */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..1a1275c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +192,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +418,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +664,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 022041b..79eeaee 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -145,6 +145,8 @@ CreateExecutorState(void)
 
 	estate->es_auxmodifytables = NIL;
 
+	estate->toc = NULL;
+
 	estate->es_per_tuple_exprcontext = NULL;
 
 	estate->es_epqTuple = NULL;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..283a136 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,30 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +167,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..a9e8524c
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,366 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		rescans a relation
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/nodeFunnel.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	 /* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution.
+	 * We do this on first execution rather than during node initialization,
+	 * as it needs to allocate large dynamic segement, so it is better to 
+	 * do if it is really needed.
+	 */
+	if (!node->pcxt)
+	{
+		EState	   *estate = node->ss.ps.state;
+		bool any_worker_launched = false;
+
+		/* Initialize the workers required to perform parallel scan. */
+		InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+								  estate,
+								  node->ss.ss_currentRelation,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pcxt->toc;
+
+		/*
+		 * For Funnel node to support execution of subplans by parallel
+		 * workers, it need to push down the list of subplans stored in
+		 * the node and corresponding list of subplans stored in planned
+		 * statement as nodes' subplans store reference to subplan in
+		 * planned statement.  Currently we don't have enough infrastructre
+		 * to support executing all kind of nodes by parallel workers, so
+		 * it's better to execute such a plan in local node.
+		 */
+		if (!node->ss.ps.plan->initPlan)
+		{
+			/*
+			 * Register backend workers. If the required number of workers are
+			 * not available then we perform the scan with available workers and
+			 * If there are no more workers available, then the funnel node will
+			 * just scan locally.
+			 */
+			LaunchParallelWorkers(node->pcxt);
+
+			node->funnel = CreateTupleQueueFunnel();
+
+			for (i = 0; i < node->pcxt->nworkers; ++i)
+			{
+				if (node->pcxt->worker[i].bgwhandle)
+				{
+					shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+					RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+					any_worker_launched = true;
+				}
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+	
+	slot = funnel_getnext(node);
+	
+	/*
+	 * if required by plugin, aggregate the buffer usage stats
+	 * from all workers.
+	 */
+	if (TupIsNull(slot))
+	{
+		int i;
+		int nworkers;
+		BufferUsage *buffer_usage_worker;
+		char *buffer_usage;
+
+		if (node->ss.ps.state->total_time)
+		{
+			nworkers = node->pcxt->nworkers;
+			buffer_usage = node->buffer_usage_space;
+
+			for (i = 0; i < nworkers; i++)
+			{
+				buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * sizeof(BufferUsage)));
+				BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+			}
+		}
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+
+	if (node->pcxt)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		if (node->fs_workersReady)
+			WaitForParallelWorkersToFinish(node->pcxt);
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+	}
+}
+
+/*
+ * funnel_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in funnel scan and if there is no
+ *  data available from queues or no worker is available, it does
+ *  fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while ((!funnelstate->all_workers_done  && funnelstate->fs_workersReady) ||
+			!funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans a relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.
+	 */
+	if (node->pcxt)
+	{
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+		node->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+
+	ExecReScan(node->ss.ps.lefttree);
+}
+
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..99cd691
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,319 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/nodePartialSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.  We pass 'toc' (place to lookup parallel scan
+	 * descriptor) via EState for parallel workers whereas master backend
+	 * stores it directly in partial scan state node.
+	 */
+	if (estate->toc)
+		node->ss.ps.toc = estate->toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize
+	 * it during ExecutorStart phase, however we need ParallelHeapScanDesc
+	 * to initialize the scan in case of this node and the same is
+	 * initialized by the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic shared
+		 * memory segment by master backend, parallel workers and local scan by
+		 * master backend retrieve it from shared memory.  If the scan descriptor
+		 * is available on first execution, then we need to re-initialize for
+		 * rescan.
+		 */
+		Assert(node->ss.ps.toc);
+	
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+	{
+		/*HeapScanDesc scan;
+		ParallelHeapScanDesc pscan;
+		EState	   *estate = node->ss.ps.state;
+
+		Assert(estate->toc);
+	
+		pscan = shm_toc_lookup(estate->toc, PARALLEL_KEY_SCAN);
+
+		scan = node->ss.ss_currentScanDesc;
+
+		heap_parallel_rescan(pscan, scan);*/
+
+		node->scan_initialized = false;
+	}
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..e4933e6
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static void
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is until the number of remaining queues fall below that pointer and
+		 * at that point make the pointer point to the first queue.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+
+			memmove(&funnel->queue[funnel->nextqueue],
+					&funnel->queue[funnel->nextqueue + 1],
+					sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+
+			if (funnel->nextqueue >= funnel->nqueues)
+				funnel->nextqueue = 0;
+
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+
+			continue;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8c9a0e..3c0123a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -355,6 +355,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4049,6 +4086,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1aa1f55..05d4b3c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -440,6 +440,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2898,6 +2916,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..6b633d6 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, see above */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,186 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+						num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 563209c..d4570f2 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1280,6 +1280,92 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1409,6 +1495,12 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1a0d358..874c272 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -220,6 +228,55 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info,
+			int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..949e79b
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0
+	 * and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cb69c03..c8422c9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,11 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path,
+								List *tlist, List *scan_clauses);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +105,12 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist,
+										   List *qpqual,
+										   Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+						   Index scanrelid, int nworkers,
+						   Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +239,8 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -343,6 +356,20 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path,
+											   tlist,
+											   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +573,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1162,87 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path,
+				   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Plan	   *subplan;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3321,6 +3431,45 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1824e7b..4717f78 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -275,6 +275,51 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+								   List *rangetable)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = 0;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ec828cd..ef8c317 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -435,6 +435,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -445,6 +446,24 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index acfd0bc..f649639 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2167,6 +2167,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index faca30b..0e5fd3a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,53 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+							Path* subpath, int nWorkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nWorkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..925bb7a
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,421 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		InitializeParallelWorkers				Setup dynamic shared memory and parallel backend workers.
+ */
+#include "postgres.h"
+
+#include "executor/nodeFunnel.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "postmaster/backendworker.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size);
+static void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters and instrumentation information that need to be
+ * retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 int instOptions, Size *params_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the usage along with it's own, so account it for each worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 3);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters and instrumentation information
+ * required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 int instOptions, int params_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	*paramsdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by
+	 * each worker.
+	 */
+	*buffer_usage_space =
+			shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePartialSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size)
+{
+	/* Estimate space for partial seq. scan specific contents. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StorePartialSeqScan
+ * 
+ * Sets up the planned statement and block range for parallel
+ * sequence scan.
+ */
+void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size)
+{
+	char		*plannedstmtdata;
+	ParallelHeapScanDesc pscan;
+
+	/* Store range table list in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Plan *plan, EState *estate, Relation rel,
+						  char **inst_options_space, char **buffer_usage_space,
+						  shm_mq_handle ***responseqp, ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size, pscan_size, plannedstmt_size;
+	char	   *plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt((PartialSeqScan *)plan,
+													 estate->es_range_table);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePartialSeqScanSpace(pcxt, estate, plannedstmt_str,
+								&plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 estate->es_instrument, &params_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePartialSeqScan(pcxt, estate, rel, plannedstmt_str,
+						plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 estate->es_instrument,
+							 params_size, inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameter's and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char		*paramsdata;
+	char		*inst_options_space;
+	char		*buffer_usage_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+
+	buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+	*buffer_usage = (buffer_usage_space +
+					 ParallelWorkerNumber * sizeof(BufferUsage));
+}
+
+/*
+ * GetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+GetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->qual);
+	fix_opfuncids((Node*) (*plannedstmt)->planTree->targetlist);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	int				inst_options;
+	char			*instrument = NULL;
+	char			*buffer_usage = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	GetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &inst_options,
+						   &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->toc = toc;
+	parallelstmt->responseq = responseq;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9b2e7f3..0c6b481 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..7a9ce3e 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -129,6 +130,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +166,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +209,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +254,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c18298..92da4f8 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -55,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1192,6 +1194,98 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										 "worker plan",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	if (parallelstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestTupleQueue);
+		SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	queryDesc->toc = parallelstmt->toc;
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required
+	 * by plugins to report the total usage for statement execution.
+	 */
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..0bbc67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -80,6 +80,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->params = params;		/* parameter values passed into query */
 	qd->instrument_options = instrument_options;		/* instrumentation
 														 * wanted? */
+	qd->toc = NULL;		/* need to be set by the caller before ExecutorStart */
 
 	/* null these fields until set by ExecutorStart */
 	qd->tupDesc = NULL;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9c74ed3..fc1d639 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -608,6 +608,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2557,6 +2559,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2744,6 +2756,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 110983f..06c5969 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -291,6 +291,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -501,6 +504,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d36e738..0a34b48 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -117,6 +117,7 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index a2381cd..56b7c75 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,7 @@ typedef struct QueryDesc
 	DestReceiver *dest;			/* the destination for tuple output */
 	ParamListInfo params;		/* param values being passed in */
 	int			instrument_options;		/* OR of InstrumentOption flags */
+	shm_toc		*toc;			/* to fetch the information from dsm */
 
 	/* These fields are set by ExecutorStart */
 	TupleDesc	tupDesc;		/* descriptor for result tuples */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1c3b2b0..0d28606 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+	InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..3af3a0e
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cb05be7
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..c979233
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,34 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ac75f86..cd79588 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
 #include "utils/reltrigger.h"
@@ -389,6 +391,18 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
+	 * This is required to collect buffer usage stats from parallel
+	 * workers when requested by plugins.
+	 */
+	bool		total_time;	/* total time spent in ExecutorRun */
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1016,6 +1030,11 @@ typedef struct PlanState
 	 * State for management of parameter-change-driven rescanning
 	 */
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
+	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc			*toc;
 
 	/*
 	 * Other run-time state needed by most if not all node types.
@@ -1216,6 +1235,45 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	char			*buffer_usage_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 38469ef..3f3d572 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +99,8 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +221,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..65b60a0 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -103,4 +103,9 @@ typedef struct ParamExecData
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c63b1a..6a94190 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,10 +20,15 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
+#include "nodes/params.h"
+#include "nodes/plannodes.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/shm_toc.h"
+#include "storage/shm_mq.h"
 
 /* Possible sources of a Query */
 typedef enum QuerySource
@@ -156,6 +161,17 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	shm_toc			*toc;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+	char			*buffer_usage;
+} ParallelStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 5f0ea1c..7cdf632 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -281,6 +281,22 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 72eb49b..c3e1f6a 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -741,6 +741,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..11f0409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..7873565 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,11 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index b10a504..8d6e350 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+											List *rangetable);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..bf91824
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,40 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define PARALLEL_KEY_BUFF_USAGE		2
+#define PARALLEL_KEY_INST_OPTIONS	3
+#define PARALLEL_KEY_INST_INFO		4
+#define PARALLEL_KEY_TUPLE_QUEUE	5
+#define PARALLEL_KEY_SCAN			6
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Plan *plan, EState *estate,
+									  Relation rel, char **inst_options_space,
+									  char **buffer_usage_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..b560672 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index b3c705f..5c25627 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#233Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#232)
Re: Parallel Seq Scan

On Fri, Mar 27, 2015 at 2:34 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The reason of this problem is that above tab-completion is executing
query [1] which contains subplan for the funnel node and currently
we don't have capability (enough infrastructure) to support execution
of subplans by parallel workers. Here one might wonder why we
have choosen Parallel Plan (Funnel node) for such a case and the
reason for same is that subplans are attached after Plan generation
(SS_finalize_plan()) and if want to discard such a plan, it will be
much more costly, tedious and not worth the effort as we have to
eventually make such a plan work.

Here we have two choices to proceed, first one is to support execution
of subplans by parallel workers and second is execute/scan locally for
Funnel node having subplan (don't launch workers).

It looks to me like the is an InitPlan, not a subplan. There
shouldn't be any problem with a Funnel node having an InitPlan; it
looks to me like all of the InitPlan stuff is handled by common code
within the executor (grep for initPlan), so it ought to work here the
same as it does for anything else. What I suspect is failing
(although you aren't being very clear about it here) is the passing
down of the parameters set by the InitPlan to the workers. I think we
need to make that work; it's an integral piece of the executor
infrastructure and we shouldn't leave it out just because it requires
a bit more IPC.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#234Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#218)
Re: Parallel Seq Scan

On Wed, Mar 18, 2015 at 11:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think I figured out the problem. That fix only helps in the case
where the postmaster noticed the new registration previously but
didn't start the worker, and then later notices the termination.
What's much more likely to happen is that the worker is started and
terminated so quickly that both happen before we create a
RegisteredBgWorker for it. The attached patch fixes that case, too.

Patch fixes the problem and now for Rescan, we don't need to Wait
for workers to finish.

I realized that there is a problem with this. If an error occurs in
one of the workers just as we're deciding to kill them all, then the
error won't be reported. Also, the new code to propagate
XactLastRecEnd won't work right, either. I think we need to find a
way to shut down the workers cleanly. The idea generally speaking
should be:

1. Tell all of the workers that we want them to shut down gracefully
without finishing the scan.

2. Wait for them to exit via WaitForParallelWorkersToFinish().

My first idea about how to implement this is to have the master detach
all of the tuple queues via a new function TupleQueueFunnelShutdown().
Then, we should change tqueueReceiveSlot() so that it does not throw
an error when shm_mq_send() returns SHM_MQ_DETACHED. We could modify
the receiveSlot method of a DestReceiver to return bool rather than
void; a "true" value can mean "continue processing" where as a "false"
value can mean "stop early, just as if we'd reached the end of the
scan".

This design will cause each parallel worker to finish producing the
tuple it's currently in the middle of generating, and then shut down.
You can imagine cases where we'd want the worker to respond faster
than that, though; for example, if it's applying a highly selective
filter condition, we'd like it to stop the scan right away, not when
it finds the next matching tuple. I can't immediately see a real
clean way of accomplishing that, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#235Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#223)
Re: Parallel Seq Scan

On Wed, Mar 25, 2015 at 6:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Apart from that I have moved the Initialization of dsm segement from
InitNode phase to ExecFunnel() (on first execution) as per suggestion
from Robert. The main idea is that as it creates large shared memory
segment, so do the work when it is really required.

So, suppose we have a plan like this:

Append
-> Funnel
-> Partial Seq Scan
-> Funnel
-> Partial Seq Scan
(repeated many times)

In earlier versions of this patch, that was chewing up lots of DSM
segments. But it seems to me, on further reflection, that it should
never use more than one at a time. The first funnel node should
initialize its workers and then when it finishes, all those workers
should get shut down cleanly and the DSM destroyed before the next
scan is initialized.

Obviously we could do better here: if we put the Funnel on top of the
Append instead of underneath it, we could avoid shutting down and
restarting workers for every child node. But even without that, I'm
hoping it's no longer the case that this uses more than one DSM at a
time. If that's not the case, we should see if we can't fix that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#236Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#233)
Re: Parallel Seq Scan

On Mon, Mar 30, 2015 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 27, 2015 at 2:34 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

The reason of this problem is that above tab-completion is executing
query [1] which contains subplan for the funnel node and currently
we don't have capability (enough infrastructure) to support execution
of subplans by parallel workers. Here one might wonder why we
have choosen Parallel Plan (Funnel node) for such a case and the
reason for same is that subplans are attached after Plan generation
(SS_finalize_plan()) and if want to discard such a plan, it will be
much more costly, tedious and not worth the effort as we have to
eventually make such a plan work.

Here we have two choices to proceed, first one is to support execution
of subplans by parallel workers and second is execute/scan locally for
Funnel node having subplan (don't launch workers).

It looks to me like the is an InitPlan, not a subplan. There
shouldn't be any problem with a Funnel node having an InitPlan; it
looks to me like all of the InitPlan stuff is handled by common code
within the executor (grep for initPlan), so it ought to work here the
same as it does for anything else. What I suspect is failing
(although you aren't being very clear about it here) is the passing
down of the parameters set by the InitPlan to the workers.

It is failing because we are not passing InitPlan itself (InitPlan is
nothing but a list of SubPlan) and I tried tried to describe in previous
mail [1]I have tried to evaluate what it would take us to support execution of subplans by parallel workers. We need to pass the sub plans stored in Funnel Node (initPlan) and corresponding subplans stored in planned statement (subplans) as subplan's stored in Funnel node has reference to subplans in planned statement. Next currently readfuncs.c (functions to read different type of nodes) doesn't support reading any type of plan node, so we need to add support for reading all kind of plan nodes (as subplan can have any type of plan node) and similarly to execute any type of Plan node, we might need more work (infrastructure). what we need to do to achieve the same, but in short, it is not
difficult to pass down the required parameters (like plan->InitPlan or
plannedstmt->subplans), rather the main missing part is the handling
of such parameters in worker side (mainly we need to provide support
for all plan nodes which can be passed as part of InitPlan in readfuncs.c).
I am not against supporting InitPlan's on worker side, but just wanted to
say that if possible why not leave that for first version.

[1]: I have tried to evaluate what it would take us to support execution of subplans by parallel workers. We need to pass the sub plans stored in Funnel Node (initPlan) and corresponding subplans stored in planned statement (subplans) as subplan's stored in Funnel node has reference to subplans in planned statement. Next currently readfuncs.c (functions to read different type of nodes) doesn't support reading any type of plan node, so we need to add support for reading all kind of plan nodes (as subplan can have any type of plan node) and similarly to execute any type of Plan node, we might need more work (infrastructure).
I have tried to evaluate what it would take us to support execution
of subplans by parallel workers. We need to pass the sub plans
stored in Funnel Node (initPlan) and corresponding subplans stored
in planned statement (subplans) as subplan's stored in Funnel node
has reference to subplans in planned statement. Next currently
readfuncs.c (functions to read different type of nodes) doesn't support
reading any type of plan node, so we need to add support for reading all
kind
of plan nodes (as subplan can have any type of plan node) and similarly
to execute any type of Plan node, we might need more work (infrastructure).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#237Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#235)
Re: Parallel Seq Scan

On Mon, Mar 30, 2015 at 8:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 25, 2015 at 6:27 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Apart from that I have moved the Initialization of dsm segement from
InitNode phase to ExecFunnel() (on first execution) as per suggestion
from Robert. The main idea is that as it creates large shared memory
segment, so do the work when it is really required.

So, suppose we have a plan like this:

Append
-> Funnel
-> Partial Seq Scan
-> Funnel
-> Partial Seq Scan
(repeated many times)

In earlier versions of this patch, that was chewing up lots of DSM
segments. But it seems to me, on further reflection, that it should
never use more than one at a time. The first funnel node should
initialize its workers and then when it finishes, all those workers
should get shut down cleanly and the DSM destroyed before the next
scan is initialized.

Obviously we could do better here: if we put the Funnel on top of the
Append instead of underneath it, we could avoid shutting down and
restarting workers for every child node. But even without that, I'm
hoping it's no longer the case that this uses more than one DSM at a
time. If that's not the case, we should see if we can't fix that.

Currently it doesn't behave you are expecting, it destroys the DSM and
perform clean shutdown of workers (DestroyParallelContext()) at the
time of ExecEndFunnel() which in this case happens when we finish
Execution of AppendNode.

One way to change it is do the clean up for parallel context when we
fetch last tuple from the FunnelNode (into ExecFunnel) as at that point
we are sure that we don't need workers or dsm anymore. Does that
sound reasonable to you?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#238Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#234)
Re: Parallel Seq Scan

On Mon, Mar 30, 2015 at 8:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 18, 2015 at 11:43 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think I figured out the problem. That fix only helps in the case
where the postmaster noticed the new registration previously but
didn't start the worker, and then later notices the termination.
What's much more likely to happen is that the worker is started and
terminated so quickly that both happen before we create a
RegisteredBgWorker for it. The attached patch fixes that case, too.

Patch fixes the problem and now for Rescan, we don't need to Wait
for workers to finish.

I realized that there is a problem with this. If an error occurs in
one of the workers just as we're deciding to kill them all, then the
error won't be reported.

We are sending SIGTERM to worker for terminating the worker, so
if the error occurs before the signal is received then it should be
sent to master backend. Am I missing something here?

Also, the new code to propagate
XactLastRecEnd won't work right, either.

As we are generating FATAL error on termination of worker
(bgworker_die()), so won't it be handled in AbortTransaction path
by below code in parallel-mode patch?

+ if (!parallel)
+ latestXid = RecordTransactionAbort(false);
+ else
+ {
+ latestXid = InvalidTransactionId;
+
+ /*
+ * Since the parallel master won't get our value of XactLastRecEnd in this
+ * case, we nudge WAL-writer ourselves in this case.  See related comments
in
+ * RecordTransactionAbort for why this matters.
+ */
+ XLogSetAsyncXactLSN(XactLastRecEnd);
+ }

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#239Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#236)
Re: Parallel Seq Scan

On Tue, Mar 31, 2015 at 8:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

It looks to me like the is an InitPlan, not a subplan. There
shouldn't be any problem with a Funnel node having an InitPlan; it
looks to me like all of the InitPlan stuff is handled by common code
within the executor (grep for initPlan), so it ought to work here the
same as it does for anything else. What I suspect is failing
(although you aren't being very clear about it here) is the passing
down of the parameters set by the InitPlan to the workers.

It is failing because we are not passing InitPlan itself (InitPlan is
nothing but a list of SubPlan) and I tried tried to describe in previous
mail [1] what we need to do to achieve the same, but in short, it is not
difficult to pass down the required parameters (like plan->InitPlan or
plannedstmt->subplans), rather the main missing part is the handling
of such parameters in worker side (mainly we need to provide support
for all plan nodes which can be passed as part of InitPlan in readfuncs.c).
I am not against supporting InitPlan's on worker side, but just wanted to
say that if possible why not leave that for first version.

Well, if we *don't* handle it, we're going to need to insert some hack
to ensure that the planner doesn't create plans. And that seems
pretty unappealing. Maybe it'll significantly compromise plan
quality, and maybe it won't, but at the least, it's ugly.

[1]
I have tried to evaluate what it would take us to support execution
of subplans by parallel workers. We need to pass the sub plans
stored in Funnel Node (initPlan) and corresponding subplans stored
in planned statement (subplans) as subplan's stored in Funnel node
has reference to subplans in planned statement. Next currently
readfuncs.c (functions to read different type of nodes) doesn't support
reading any type of plan node, so we need to add support for reading all
kind
of plan nodes (as subplan can have any type of plan node) and similarly
to execute any type of Plan node, we might need more work (infrastructure).

I don't think you need to do anything that complicated. I'm not
proposing to *run* the initPlan in the workers, just to pass the
parameter values down.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#240Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#237)
Re: Parallel Seq Scan

On Wed, Apr 1, 2015 at 6:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 30, 2015 at 8:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

So, suppose we have a plan like this:

Append
-> Funnel
-> Partial Seq Scan
-> Funnel
-> Partial Seq Scan
(repeated many times)

In earlier versions of this patch, that was chewing up lots of DSM
segments. But it seems to me, on further reflection, that it should
never use more than one at a time. The first funnel node should
initialize its workers and then when it finishes, all those workers
should get shut down cleanly and the DSM destroyed before the next
scan is initialized.

Obviously we could do better here: if we put the Funnel on top of the
Append instead of underneath it, we could avoid shutting down and
restarting workers for every child node. But even without that, I'm
hoping it's no longer the case that this uses more than one DSM at a
time. If that's not the case, we should see if we can't fix that.

Currently it doesn't behave you are expecting, it destroys the DSM and
perform clean shutdown of workers (DestroyParallelContext()) at the
time of ExecEndFunnel() which in this case happens when we finish
Execution of AppendNode.

One way to change it is do the clean up for parallel context when we
fetch last tuple from the FunnelNode (into ExecFunnel) as at that point
we are sure that we don't need workers or dsm anymore. Does that
sound reasonable to you?

Yeah, I think that's exactly what we should do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#241Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#238)
Re: Parallel Seq Scan

On Wed, Apr 1, 2015 at 7:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Patch fixes the problem and now for Rescan, we don't need to Wait
for workers to finish.

I realized that there is a problem with this. If an error occurs in
one of the workers just as we're deciding to kill them all, then the
error won't be reported.

We are sending SIGTERM to worker for terminating the worker, so
if the error occurs before the signal is received then it should be
sent to master backend. Am I missing something here?

The master only checks for messages at intervals - each
CHECK_FOR_INTERRUPTS(), basically. So when the master terminates the
workers, any errors generated after the last check for messages will
be lost.

Also, the new code to propagate
XactLastRecEnd won't work right, either.

As we are generating FATAL error on termination of worker
(bgworker_die()), so won't it be handled in AbortTransaction path
by below code in parallel-mode patch?

That will asynchronously flush the WAL, but if the master goes on to
commit, we've wait synchronously for WAL flush, and possibly sync rep.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#242Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#239)
Re: Parallel Seq Scan

On Wed, Apr 1, 2015 at 6:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 31, 2015 at 8:53 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

It looks to me like the is an InitPlan, not a subplan. There
shouldn't be any problem with a Funnel node having an InitPlan; it
looks to me like all of the InitPlan stuff is handled by common code
within the executor (grep for initPlan), so it ought to work here the
same as it does for anything else. What I suspect is failing
(although you aren't being very clear about it here) is the passing
down of the parameters set by the InitPlan to the workers.

It is failing because we are not passing InitPlan itself (InitPlan is
nothing but a list of SubPlan) and I tried tried to describe in previous
mail [1] what we need to do to achieve the same, but in short, it is not
difficult to pass down the required parameters (like plan->InitPlan or
plannedstmt->subplans), rather the main missing part is the handling
of such parameters in worker side (mainly we need to provide support
for all plan nodes which can be passed as part of InitPlan in

readfuncs.c).

I am not against supporting InitPlan's on worker side, but just wanted

to

say that if possible why not leave that for first version.

Well, if we *don't* handle it, we're going to need to insert some hack
to ensure that the planner doesn't create plans. And that seems
pretty unappealing. Maybe it'll significantly compromise plan
quality, and maybe it won't, but at the least, it's ugly.

I also think changing anything in planner related to this is not a
good idea, but what about detecting this during execution (into
ExecFunnel) and then just run the plan locally (by master backend)?

[1]
I have tried to evaluate what it would take us to support execution
of subplans by parallel workers. We need to pass the sub plans
stored in Funnel Node (initPlan) and corresponding subplans stored
in planned statement (subplans) as subplan's stored in Funnel node
has reference to subplans in planned statement. Next currently
readfuncs.c (functions to read different type of nodes) doesn't support
reading any type of plan node, so we need to add support for reading all
kind
of plan nodes (as subplan can have any type of plan node) and similarly
to execute any type of Plan node, we might need more work

(infrastructure).

I don't think you need to do anything that complicated. I'm not
proposing to *run* the initPlan in the workers, just to pass the
parameter values down.

Sorry, but I am not able to understand how it will help if just parameters
(If I understand correctly you want to say about Bitmapset *extParam;
Bitmapset *allParam; in Plan structure) are passed to workers and I
think they are already getting passed only initPlan and related Subplan
in planned statement is not passed and the reason is that ss_finalize_plan()
attaches initPlan to top node (which in this case is Funnel node and not
PartialSeqScan)

By any chance, do you mean that we run the part of the statement in
workers and then run initPlan in master backend?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#243Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#242)
Re: Parallel Seq Scan

On Wed, Apr 1, 2015 at 10:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Well, if we *don't* handle it, we're going to need to insert some hack
to ensure that the planner doesn't create plans. And that seems
pretty unappealing. Maybe it'll significantly compromise plan
quality, and maybe it won't, but at the least, it's ugly.

I also think changing anything in planner related to this is not a
good idea, but what about detecting this during execution (into
ExecFunnel) and then just run the plan locally (by master backend)?

That seems like an even bigger hack; we want to minimize the number of
cases where we create a parallel plan and then don't go parallel.
Doing that in the hopefully-rare case where we manage to blow out the
DSM segments seems OK, but doing it every time a plan of a certain
type gets created doesn't seem very appealing to me.

[1]
I have tried to evaluate what it would take us to support execution
of subplans by parallel workers. We need to pass the sub plans
stored in Funnel Node (initPlan) and corresponding subplans stored
in planned statement (subplans) as subplan's stored in Funnel node
has reference to subplans in planned statement. Next currently
readfuncs.c (functions to read different type of nodes) doesn't support
reading any type of plan node, so we need to add support for reading all
kind
of plan nodes (as subplan can have any type of plan node) and similarly
to execute any type of Plan node, we might need more work
(infrastructure).

I don't think you need to do anything that complicated. I'm not
proposing to *run* the initPlan in the workers, just to pass the
parameter values down.

Sorry, but I am not able to understand how it will help if just parameters
(If I understand correctly you want to say about Bitmapset *extParam;
Bitmapset *allParam; in Plan structure) are passed to workers and I
think they are already getting passed only initPlan and related Subplan
in planned statement is not passed and the reason is that ss_finalize_plan()
attaches initPlan to top node (which in this case is Funnel node and not
PartialSeqScan)

By any chance, do you mean that we run the part of the statement in
workers and then run initPlan in master backend?

If I'm not confused, it would be the other way around. We would run
the initPlan in the master backend *first* and then the rest in the
workers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#244Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#243)
Re: Parallel Seq Scan

On Wed, Apr 1, 2015 at 8:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think you need to do anything that complicated. I'm not
proposing to *run* the initPlan in the workers, just to pass the
parameter values down.

Sorry, but I am not able to understand how it will help if just

parameters

(If I understand correctly you want to say about Bitmapset *extParam;
Bitmapset *allParam; in Plan structure) are passed to workers and I
think they are already getting passed only initPlan and related Subplan
in planned statement is not passed and the reason is that

ss_finalize_plan()

attaches initPlan to top node (which in this case is Funnel node and not
PartialSeqScan)

By any chance, do you mean that we run the part of the statement in
workers and then run initPlan in master backend?

If I'm not confused, it would be the other way around. We would run
the initPlan in the master backend *first* and then the rest in the
workers.

Either one of us is confused, let me try to describe my understanding in
somewhat more detail. Let me try to explain w.r.t the tab completion
query [1]SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog') UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring (pg_catalog.quote_ident(n.nspname) || '.',1,3)='pgb' AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring ('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,3)='pgb' AND substring(pg_catalog.quote_ident (n.nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1) AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) = 1 LIMIT 1000;. In this, the initPlan is generated for Qualification expression
[2]: (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)
look like:

postgres.exe!ExecSeqScan(ScanState * node=0x000000000c33bce8) Line 113 C
postgres.exe!ExecProcNode(PlanState * node=0x000000000c33bce8) Line 418
+ 0xa bytes C
postgres.exe!ExecSetParamPlan(SubPlanState * node=0x000000000c343930,
ExprContext * econtext=0x000000000c33de50) Line 1001 + 0xa bytes C

postgres.exe!ExecEvalParamExec(ExprState * exprstate=0x000000000c33f980,

ExprContext * econtext=0x000000000c33de50, char *
isNull=0x000000000c33f481, ExprDoneCond * isDone=0x0000000000000000) Line
1111 C
postgres.exe!ExecMakeFunctionResultNoSets(FuncExprState *
fcache=0x000000000c33f0d0, ExprContext * econtext=0x000000000c33de50, char
* isNull=0x000000000042f1c8, ExprDoneCond * isDone=0x0000000000000000)
Line 1992 + 0x2d bytes C
postgres.exe!ExecEvalOper(FuncExprState * fcache=0x000000000c33f0d0,
ExprContext * econtext=0x000000000c33de50, char *
isNull=0x000000000042f1c8, ExprDoneCond * isDone=0x0000000000000000) Line
2443 C
postgres.exe!ExecQual(List * qual=0x000000000c33fa08, ExprContext *
econtext=0x000000000c33de50, char resultForNull=0) Line 5206 + 0x1a bytes C
postgres.exe!ExecScan(ScanState * node=0x000000000c33dd38, TupleTableSlot
* (ScanState *)* accessMtd=0x0000000140232940, char (ScanState *,
TupleTableSlot *)* recheckMtd=0x00000001402329e0) Line 195 + 0x1a bytes C
postgres.exe!ExecSeqScan(ScanState * node=0x000000000c33dd38) Line 114 C

Basically here initPlan is getting executed during Qualification.

So now the point I am not able to understand from your explanation
is that how the worker will perform qualification without the knowledge
of initPlan?

[1]: SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog') UNION SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring (pg_catalog.quote_ident(n.nspname) || '.',1,3)='pgb' AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring ('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident (n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,3)='pgb' AND substring(pg_catalog.quote_ident (n.nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1) AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) = 1 LIMIT 1000;
SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE
c.relkind IN ('r', 'S', 'v', 'm',
'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND
pg_catalog.pg_table_is_visible(c.oid) AND
c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname =
'pg_catalog') UNION SELECT
pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n
WHERE substring
(pg_catalog.quote_ident(n.nspname) || '.',1,3)='pgb' AND (SELECT
pg_catalog.count(*) FROM
pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) ||
'.',1,3) = substring
('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION
SELECT pg_catalog.quote_ident
(n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM
pg_catalog.pg_class c, pg_catalog.pg_namespace n
WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f') AND
substring(pg_catalog.quote_ident
(n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,3)='pgb' AND
substring(pg_catalog.quote_ident
(n.nspname) || '.',1,3) =
substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1)
AND
(SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
substring(pg_catalog.quote_ident(nspname) ||
'.',1,3) =
substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) =
1 LIMIT 1000;

[2]: (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)
(SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
substring(pg_catalog.quote_ident(nspname) ||
'.',1,3) =
substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#245Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#241)
Re: Parallel Seq Scan

On Wed, Apr 1, 2015 at 6:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 1, 2015 at 7:30 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Also, the new code to propagate
XactLastRecEnd won't work right, either.

As we are generating FATAL error on termination of worker
(bgworker_die()), so won't it be handled in AbortTransaction path
by below code in parallel-mode patch?

That will asynchronously flush the WAL, but if the master goes on to
commit, we've wait synchronously for WAL flush, and possibly sync rep.

Okay, so you mean if master backend later commits, then there is
a chance of loss of WAL data written by worker.
Can't we report the location to master as the patch does in case of
Commit in worker?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#246Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#244)
Re: Parallel Seq Scan

On Thu, Apr 2, 2015 at 2:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

If I'm not confused, it would be the other way around. We would run
the initPlan in the master backend *first* and then the rest in the
workers.

Either one of us is confused, let me try to describe my understanding in
somewhat more detail. Let me try to explain w.r.t the tab completion
query [1]. In this, the initPlan is generated for Qualification expression
[2], so it will be executed during qualification and the callstack will
look like:

postgres.exe!ExecSeqScan(ScanState * node=0x000000000c33bce8) Line 113 C
postgres.exe!ExecProcNode(PlanState * node=0x000000000c33bce8) Line 418 +
0xa bytes C
postgres.exe!ExecSetParamPlan(SubPlanState * node=0x000000000c343930,
ExprContext * econtext=0x000000000c33de50) Line 1001 + 0xa bytes C

postgres.exe!ExecEvalParamExec(ExprState * exprstate=0x000000000c33f980,
ExprContext * econtext=0x000000000c33de50, char * isNull=0x000000000c33f481,
ExprDoneCond * isDone=0x0000000000000000) Line 1111 C

postgres.exe!ExecMakeFunctionResultNoSets(FuncExprState *
fcache=0x000000000c33f0d0, ExprContext * econtext=0x000000000c33de50, char *
isNull=0x000000000042f1c8, ExprDoneCond * isDone=0x0000000000000000) Line
1992 + 0x2d bytes C
postgres.exe!ExecEvalOper(FuncExprState * fcache=0x000000000c33f0d0,
ExprContext * econtext=0x000000000c33de50, char * isNull=0x000000000042f1c8,
ExprDoneCond * isDone=0x0000000000000000) Line 2443 C
postgres.exe!ExecQual(List * qual=0x000000000c33fa08, ExprContext *
econtext=0x000000000c33de50, char resultForNull=0) Line 5206 + 0x1a bytes C
postgres.exe!ExecScan(ScanState * node=0x000000000c33dd38, TupleTableSlot
* (ScanState *)* accessMtd=0x0000000140232940, char (ScanState *,
TupleTableSlot *)* recheckMtd=0x00000001402329e0) Line 195 + 0x1a bytes C
postgres.exe!ExecSeqScan(ScanState * node=0x000000000c33dd38) Line 114 C

Basically here initPlan is getting executed during Qualification.

OK, I failed to realize that the initPlan doesn't get evaluated until
first use. Maybe in the case of a funnel node we should force all of
the initplans to be run before starting parallelism, so that we can
pass down the resulting value to each worker. If we try to push the
whole plan tree down from the worker then, aside from the issue of
needing to copy the plan tree, it'll get evaluated N times instead of
once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#247Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#245)
Re: Parallel Seq Scan

On Thu, Apr 2, 2015 at 3:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 1, 2015 at 6:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 1, 2015 at 7:30 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Also, the new code to propagate
XactLastRecEnd won't work right, either.

As we are generating FATAL error on termination of worker
(bgworker_die()), so won't it be handled in AbortTransaction path
by below code in parallel-mode patch?

That will asynchronously flush the WAL, but if the master goes on to
commit, we've wait synchronously for WAL flush, and possibly sync rep.

Okay, so you mean if master backend later commits, then there is
a chance of loss of WAL data written by worker.
Can't we report the location to master as the patch does in case of
Commit in worker?

That's exactly why I think it needs to call
WaitForParallelWorkersToFinish() - because it will do just that. We
only need to invent a way of telling the worker to stop the scan and
shut down cleanly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#248David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#168)
Re: Parallel Seq Scan

So I've just finished reading the impressive 244 emails (so far) about
Parallel Seq scan, and I've had a quick skim over the latest patch.

Its quite exciting to think that one day we'll have parallel query in
PostgreSQL, but I have to say, that I think that there's a major point
about the proposed implementation that seems to have gotten forgotten
about, which I can't help but think won't get that far off the ground
unless more thought goes into it.

On 11 February 2015 at 09:56, Andres Freund <andres@2ndquadrant.com> wrote:

I think we're getting to the point where having a unique mapping from
the plan to the execution tree is proving to be rather limiting
anyway. Check for example discussion about join removal. But even for
current code, showing only the custom plans for the first five EXPLAIN
EXECUTEs is pretty nasty (Try explain that to somebody that doesn't know
pg internals. Their looks are worth gold and can kill you at the same
time) and should be done differently.

Going over the previous emails in this thread I see that it has been a long
time since anyone discussed anything around how we might decide at planning
time how many workers should be used for the query, and from the emails I
don't recall anyone proposing a good idea about how this might be done, and
I for one can't see how this is at all possible to do at planning time.

I think that the planner should know nothing of parallel query at all, and
the planner quite possibly should go completely unmodified for this patch.
One major problem I can see is that, given a query such as:

SELECT * FROM million_row_product_table WHERE category = 'ELECTRONICS';

Where we have a non-unique index on category, some plans which may be
considered might be:

1. Index scan on the category index to get all rows matching 'ELECTRONICS'
2. Sequence scan on the table, filter matching rows.
3. Parallel plan which performs a series of partial sequence scans pulling
out all matching rows.

I really think that if we end choosing things like plan 3, when plan 2 was
thrown out because of its cost, then we'll end up consuming more CPU and
I/O than we can possibly justify using. The environmentalist in me screams
that this is wrong. What if we kicked off 128 worker process on some
high-end hardware to do this? I certainly wouldn't want to pay the power
bill. I understand there's costing built in to perhaps stop this, but I
still think it's wrong headed, and we need to still choose the fastest
non-parallel plan and only consider parallelising that later.

Instead what I think should happen is:

The following link has been seen before on this thread, but I'll post it
again:
http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqoconce.htm

There's one key sentence in there that should not be ignored:

"It is important to note that the query is parallelized dynamically at
execution time."

"dynamically at execution time"... I think this also needs to happen in
PostgreSQL. If we attempt to do this parallel stuff at plan time, and we
happen to plan at some quiet period, or perhaps worse, some application's
start-up process happens to PREPARE a load of queries when the database is
nice and quite, then quite possibly we'll end up with some highly parallel
queries. Then perhaps come the time these queries are actually executed the
server is very busy... Things will fall apart quite quickly due to the
masses of IPC and context switches that would be going on.

I completely understand that this parallel query stuff is all quite new to
us all and we're likely still trying to nail down the correct
infrastructure for it to work well, so this is why I'm proposing that the
planner should know nothing of parallel query, instead I think it should
work more along the lines of:

* Planner should be completely oblivious to what parallel query is.
* Before executor startup the plan is passed to a function which decides if
we should parallelise it, and does so if the plan meets the correct
requirements. This should likely have a very fast exit path such as:

if root node's cost < parallel_query_cost_threshold
return; /* the query is not expensive enough to attempt to make parallel
*/

The above check will allow us to have an almost zero overhead for small low
cost queries.

This function would likely also have some sort of logic in order to
determine if the server has enough spare resource at the current point in
time to allow queries to be parallelised (Likely this is not too important
to nail this down for a first implementation).

* The plan should then be completely traversed node by node to determine
which nodes can be made parallel. This would likely require an interface
function to each node which returns true or false, depending on if it's
safe to parallelise. For seq scan this could be a simple test to see if
we're scanning a temp table.

* Before any changes are made to the plan, a complete copy of it should be
made.

* Funnel nodes could then be injected below the last node in each branch
which supports parallelism. If more than one branch exists with parallel
enabled nodes, then it should be up to this function to determine, based on
cost, which nodes will benefit the most from the additional workers.
Certain other node types would need something else below the Funnel node,
e.g Partial aggregation would need a new node below the Funnel to complete
the aggregation.

* The first parallel enabled nodes should be passed off to the worker
processes for execution.

So I quite strongly agree with Andres' comment above that we really need to
move away from this 1:1 assumption about the relationship between plan
nodes and executor nodes. Tom did mention some possible reasons here in his
response to my INNER JOIN removals patch ->
/messages/by-id/32139.1427667410@sss.pgh.pa.us

Tom wrote:
"What you're doing here violates the rule that planstate trees have a
one-to-one relationship to plan trees. EXPLAIN used to iterate over those
trees in lockstep, and there probably still is code that does similar
things (in third-party modules if not core), so I don't think we should
abandon that principle."

So perhaps this needs analysis. If it's not possible, then perhaps the
parallel nodes could be inserted at the end of planning, providing the
executor could be coded in such a way that the parallel plan can still work
with 0 worker processes. Unfortunately it seems that transitions through
nodes that don't do anything is not free, so with this method there would
be a slowdown of parallel enabled plans when they're executed without any
worker processes.

Also here ->
https://technet.microsoft.com/en-us/library/ms178065%28v=sql.105%29.aspx
There's some text that says:

"The SQL Server query optimizer does not use a parallel execution plan for
a query if any one of the following conditions is true:"
"* A serial execution plan is considered faster than any possible parallel
execution plan for the particular query."

I'm finding it a bit hard to get a true meaning from that, but if I'm not
mistaken it means that the serial plan will be preferred over a parallel
plan, as if the parallel plan does not get allocated any workers at
execution time, then we don't want to be left with a slow plan...

Apologies if any of this has been discussed any already designed around, I
just didn't see anything in the emails to indicate that it has.

Regards

David Rowley

#249Kevin Grittner
kgrittn@ymail.com
In reply to: David Rowley (#248)
Re: Parallel Seq Scan

David Rowley <dgrowleyml@gmail.com> wrote:

If we attempt to do this parallel stuff at plan time, and we
happen to plan at some quiet period, or perhaps worse, some
application's start-up process happens to PREPARE a load of
queries when the database is nice and quite, then quite possibly
we'll end up with some highly parallel queries. Then perhaps come
the time these queries are actually executed the server is very
busy... Things will fall apart quite quickly due to the masses of
IPC and context switches that would be going on.

I completely understand that this parallel query stuff is all
quite new to us all and we're likely still trying to nail down
the correct infrastructure for it to work well, so this is why
I'm proposing that the planner should know nothing of parallel
query, instead I think it should work more along the lines of:

* Planner should be completely oblivious to what parallel query
is.
* Before executor startup the plan is passed to a function which
decides if we should parallelise it, and does so if the plan
meets the correct requirements. This should likely have a very
fast exit path such as:
if root node's cost < parallel_query_cost_threshold
return; /* the query is not expensive enough to attempt to make parallel */

The above check will allow us to have an almost zero overhead for
small low cost queries.

This function would likely also have some sort of logic in order
to determine if the server has enough spare resource at the
current point in time to allow queries to be parallelised

There is a lot to like about this suggestion.

I've seen enough performance crashes due to too many concurrent
processes (even when each connection can only use a single process)
to believe that, for a plan which will be saved, it is possible to
know at planning time whether parallelization will be a nice win or
a devastating over-saturation of resources during some later
execution phase.

Another thing to consider is that this is not entirely unrelated to
the concept of admission control policies. Perhaps this phase
could be a more general execution start-up admission control phase,
where parallel processing would be one adjustment that could be
considered. Initially it might be the *only* consideration, but it
might be good to try to frame it in a way that allowed
implementation of other policies, too.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#250Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#248)
Re: Parallel Seq Scan

On Sat, Apr 4, 2015 at 5:19 AM, David Rowley <dgrowleyml@gmail.com> wrote:

Going over the previous emails in this thread I see that it has been a long
time since anyone discussed anything around how we might decide at planning
time how many workers should be used for the query, and from the emails I
don't recall anyone proposing a good idea about how this might be done, and
I for one can't see how this is at all possible to do at planning time.

I think that the planner should know nothing of parallel query at all, and
the planner quite possibly should go completely unmodified for this patch.
One major problem I can see is that, given a query such as:

SELECT * FROM million_row_product_table WHERE category = 'ELECTRONICS';

Where we have a non-unique index on category, some plans which may be
considered might be:

1. Index scan on the category index to get all rows matching 'ELECTRONICS'
2. Sequence scan on the table, filter matching rows.
3. Parallel plan which performs a series of partial sequence scans pulling
out all matching rows.

I really think that if we end choosing things like plan 3, when plan 2 was
thrown out because of its cost, then we'll end up consuming more CPU and I/O
than we can possibly justify using. The environmentalist in me screams that
this is wrong. What if we kicked off 128 worker process on some high-end
hardware to do this? I certainly wouldn't want to pay the power bill. I
understand there's costing built in to perhaps stop this, but I still think
it's wrong headed, and we need to still choose the fastest non-parallel plan
and only consider parallelising that later.

I agree that this is an area that needs more thought. I don't
(currently, anyway) agree that the planner shouldn't know anything
about parallelism. The problem with that is that there's lots of
relevant stuff that can only be known at plan time. For example,
consider the query you mention above on a table with no index. If the
WHERE clause is highly selective, a parallel plan may well be best.
But if the selectivity is only, say, 50%, a parallel plan is stupid:
the IPC costs of shipping many rows back to the master will overwhelm
any benefit we could possibly have hoped to get, and the overall
result will likely be that the parallel plan both runs slower and uses
more resources. At plan time, we have the selectivity information
conveniently at hand, and can use that as part of the cost model to
make educated decisions. Execution time is way too late to be
thinking about those kinds of questions.

I think one of the philosophical questions that has to be answered
here is "what does it mean to talk about the cost of a parallel
plan?". For a non-parallel plan, the cost of the plan means both "the
amount of effort we will spend executing the plan" and also "the
amount of time we think the plan will take to complete", but those two
things are different for parallel plans. I'm inclined to think it's
right to view the cost of a parallel plan as a proxy for execution
time, because the fundamental principle of the planner is that we pick
the lowest-cost plan. But there also clearly needs to be some way to
prevent the selection of a plan which runs slightly faster at the cost
of using vastly more resources.

Currently, the planner tracks the best unsorted path for each relation
as well as the best path for each useful sort order. Suppose we treat
parallelism as another axis for judging the quality of a plan: we keep
the best unsorted, non-parallel path; the best non-parallel path for
each useful sort order; the best unsorted, parallel path; and the best
parallel path for each sort order. Each time we plan a node, we
generate non-parallel paths first, and then parallel paths. But, if a
parallel plan isn't markedly faster than the non-parallel plan for the
same sort order, then we discard it. I'm not sure exactly what the
thresholds should be here, and they probably need to be configurable,
because on a single-user system with excess capacity available it may
be absolutely desirable to use ten times the resources to get an
answer 25% faster, but on a heavy-loaded system that will stink.

Some ideas for GUCs:

max_parallel_degree = The largest number of processes we'll consider
using for a single query.
min_parallel_speedup = The minimum percentage by which a parallel path
must be cheaper (in terms of execution time) than a non-parallel path
in order to survive. I'm imagining the default here might be
something like 15%.
min_parallel_speedup_per_worker = Like the previous one, but per
worker. e.g. if this is 5%, which might be a sensible default, then a
plan with 4 workers must be at least 20% better to survive, but a plan
using only 2 workers only needs to be 10% better.

An additional benefit of this line of thinking is that planning would
always produce a best non-parallel path. And sometimes, there would
also be a best parallel path that is expected to run faster. We could
then choose between them dynamically at execution time.

I think it's pretty hard to imagine a scenario as extreme as the one
you mention above ever actually occurring in practice. I mean, even
the most naive implementation of parallel query will presumably have
something like max_parallel_degree, and you probably won't have that
set to 128. For starters, it can't possibly make sense unless you
server has at least 128 CPUs, and even then it only makes sense if you
don't mind a single query using all of them, and even if the first of
those things is true, the second one probably isn't. I don't doubt
that less extreme variants of this scenario are possible, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#251Amit Kapila
amit.kapila16@gmail.com
In reply to: Kevin Grittner (#249)
Re: Parallel Seq Scan

On Wed, Apr 8, 2015 at 1:53 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

David Rowley <dgrowleyml@gmail.com> wrote:

If we attempt to do this parallel stuff at plan time, and we
happen to plan at some quiet period, or perhaps worse, some
application's start-up process happens to PREPARE a load of
queries when the database is nice and quite, then quite possibly
we'll end up with some highly parallel queries. Then perhaps come
the time these queries are actually executed the server is very
busy... Things will fall apart quite quickly due to the masses of
IPC and context switches that would be going on.

I completely understand that this parallel query stuff is all
quite new to us all and we're likely still trying to nail down
the correct infrastructure for it to work well, so this is why
I'm proposing that the planner should know nothing of parallel
query, instead I think it should work more along the lines of:

* Planner should be completely oblivious to what parallel query
is.
* Before executor startup the plan is passed to a function which
decides if we should parallelise it, and does so if the plan
meets the correct requirements. This should likely have a very
fast exit path such as:
if root node's cost < parallel_query_cost_threshold
return; /* the query is not expensive enough to attempt to make

parallel */

The above check will allow us to have an almost zero overhead for
small low cost queries.

This function would likely also have some sort of logic in order
to determine if the server has enough spare resource at the
current point in time to allow queries to be parallelised

There is a lot to like about this suggestion.

I've seen enough performance crashes due to too many concurrent
processes (even when each connection can only use a single process)
to believe that, for a plan which will be saved, it is possible to
know at planning time whether parallelization will be a nice win or
a devastating over-saturation of resources during some later
execution phase.

Another thing to consider is that this is not entirely unrelated to
the concept of admission control policies. Perhaps this phase
could be a more general execution start-up admission control phase,
where parallel processing would be one adjustment that could be
considered.

I think there is always a chance that resources (like parallel-workers)
won't be available at run-time even if we decide about them at
executor-start phase unless we block it for that node's usage and OTOH
if we block it (by allocating) those resources during executor-start phase
then we might end up blocking it too early or may be they won't even get
used if we decide not to execute that node. On that basis, it seems to
me current strategy is not bad where we decide during planning time and
later during execution time if not all resources (particularly
parallel-workers)
are not available, then we use only the available one's to execute the plan.
Going forward, I think we can improve the same if we decide not to shutdown
parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#252Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#250)
Re: Parallel Seq Scan

On Wed, Apr 8, 2015 at 7:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I agree that this is an area that needs more thought. I don't
(currently, anyway) agree that the planner shouldn't know anything
about parallelism. The problem with that is that there's lots of
relevant stuff that can only be known at plan time. For example,
consider the query you mention above on a table with no index. If the
WHERE clause is highly selective, a parallel plan may well be best.
But if the selectivity is only, say, 50%, a parallel plan is stupid:
the IPC costs of shipping many rows back to the master will overwhelm
any benefit we could possibly have hoped to get, and the overall
result will likely be that the parallel plan both runs slower and uses
more resources. At plan time, we have the selectivity information
conveniently at hand, and can use that as part of the cost model to
make educated decisions. Execution time is way too late to be
thinking about those kinds of questions.

I think one of the philosophical questions that has to be answered
here is "what does it mean to talk about the cost of a parallel
plan?". For a non-parallel plan, the cost of the plan means both "the
amount of effort we will spend executing the plan" and also "the
amount of time we think the plan will take to complete", but those two
things are different for parallel plans. I'm inclined to think it's
right to view the cost of a parallel plan as a proxy for execution
time, because the fundamental principle of the planner is that we pick
the lowest-cost plan. But there also clearly needs to be some way to
prevent the selection of a plan which runs slightly faster at the cost
of using vastly more resources.

Currently, the planner tracks the best unsorted path for each relation
as well as the best path for each useful sort order. Suppose we treat
parallelism as another axis for judging the quality of a plan: we keep
the best unsorted, non-parallel path; the best non-parallel path for
each useful sort order; the best unsorted, parallel path; and the best
parallel path for each sort order. Each time we plan a node, we
generate non-parallel paths first, and then parallel paths. But, if a
parallel plan isn't markedly faster than the non-parallel plan for the
same sort order, then we discard it.

One disadvantage of retaining parallel-paths could be that it can
increase the number of combinations planner might need to evaluate
during planning (in particular during join path evaluation) unless we
do some special handling to avoid evaluation of such combinations.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#253David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#250)
Re: Parallel Seq Scan

On 8 April 2015 at 14:24, Robert Haas <robertmhaas@gmail.com> wrote:

I think one of the philosophical questions that has to be answered
here is "what does it mean to talk about the cost of a parallel
plan?". For a non-parallel plan, the cost of the plan means both "the
amount of effort we will spend executing the plan" and also "the
amount of time we think the plan will take to complete", but those two
things are different for parallel plans. I'm inclined to think it's
right to view the cost of a parallel plan as a proxy for execution
time, because the fundamental principle of the planner is that we pick
the lowest-cost plan. But there also clearly needs to be some way to
prevent the selection of a plan which runs slightly faster at the cost
of using vastly more resources.

I'd agree with that as far as CPU costs, or maybe I'd just disagree with
the alternative, as if we costed in <cost of individual worker's work> *
<number of workers> then we'd never choose a parallel plan, as by the time
we costed in tuple communication costs between the processes a parallel
plan would always cost more than the serial equivalent. I/O costs are
different, I'd imagine these shouldn't be divided by the estimated number
of workers.

Currently, the planner tracks the best unsorted path for each relation
as well as the best path for each useful sort order. Suppose we treat
parallelism as another axis for judging the quality of a plan: we keep
the best unsorted, non-parallel path; the best non-parallel path for
each useful sort order; the best unsorted, parallel path; and the best
parallel path for each sort order. Each time we plan a node, we
generate non-parallel paths first, and then parallel paths. But, if a
parallel plan isn't markedly faster than the non-parallel plan for the
same sort order, then we discard it. I'm not sure exactly what the
thresholds should be here, and they probably need to be configurable,
because on a single-user system with excess capacity available it may
be absolutely desirable to use ten times the resources to get an
answer 25% faster, but on a heavy-loaded system that will stink.

But with this, and the parallel costing model above, to know the cost of a
parallel path, you need to know how many workers will be available later at
execution time in order to know what that percentage is, or would we just
always assume we'd get max_parallel_degree each time the plan is executed,
similar to how the latest patch works?

Some ideas for GUCs:

max_parallel_degree = The largest number of processes we'll consider
using for a single query.
min_parallel_speedup = The minimum percentage by which a parallel path
must be cheaper (in terms of execution time) than a non-parallel path
in order to survive. I'm imagining the default here might be
something like 15%.
min_parallel_speedup_per_worker = Like the previous one, but per
worker. e.g. if this is 5%, which might be a sensible default, then a
plan with 4 workers must be at least 20% better to survive, but a plan
using only 2 workers only needs to be 10% better.

max_parallel_degree feels awfully like it would have to be set
conservatively, similar to how work_mem is today. Like with work_mem,
during quiet periods it sure would be nice if it could magically increase.

An additional benefit of this line of thinking is that planning would
always produce a best non-parallel path. And sometimes, there would
also be a best parallel path that is expected to run faster. We could
then choose between them dynamically at execution time.

Actually store 2 plans within the plan? Like with an AlternativePlanNode?

I think it's pretty hard to imagine a scenario as extreme as the one
you mention above ever actually occurring in practice. I mean, even
the most naive implementation of parallel query will presumably have
something like max_parallel_degree, and you probably won't have that
set to 128. For starters, it can't possibly make sense unless you
server has at least 128 CPUs, and even then it only makes sense if you
don't mind a single query using all of them, and even if the first of
those things is true, the second one probably isn't. I don't doubt
that less extreme variants of this scenario are possible, though.

Yeah maybe, it does seem quite extreme, but maybe less so as the years roll
on a bit... perhaps in 5-10 years it might be quite common to have that
many spare CPU cores to throw at a task.

I think if we have this percentage GUC you mentioned to prefer parallel
plans if they're within a % threshold of the serial plan, then we could end
up with problems with I/O and buffers getting thrown out of caches due to
the extra I/O involved in parallel plans going with seq scans instead of
serial plans choosing index scans.

In summary it sounds like with my idea we get:

Pros
* Optimal plan if no workers are available at execution time.
* Parallelism possible if the chosen optimal plan happens to support
parallelism, e.g not index scan.
* No planning overhead

Cons:
* The plan "Parallelizer" must make changes to the plan just before
execution time, which ruins the 1 to 1 ratio of plan/executor nodes by the
time you inject Funnel nodes.

If we parallelise during planning time:

Pros
* More chance of getting a parallel friendly plan which could end up being
very fast if we get enough workers at executor time.

Cons:
* May produce non optimal plans if no worker processes are available during
execution time.
* Planning overhead for considering parallel paths.
* The parallel plan may blow out buffer caches due to increased I/O of
parallel plan.

Of course please say if I've missed any pro or con.

Regards

David Rowley

#254Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#251)
Re: Parallel Seq Scan

On 08-04-2015 PM 12:46, Amit Kapila wrote:

Going forward, I think we can improve the same if we decide not to shutdown
parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

I wonder if it makes sense to invent the notion of a global pool of workers
with configurable number of workers that are created at postmaster start and
destroyed at shutdown and requested for use when a query uses parallelizable
nodes. That way, parallel costing model might be better able to factor in the
available-resources-for-parallelization aspect, too. Though, I'm not quite
sure how that helps solve (if at all) the problem of occasional unjustifiable
resource consumption due to parallelization.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#255David Rowley
dgrowleyml@gmail.com
In reply to: Amit Kapila (#251)
Re: Parallel Seq Scan

On 8 April 2015 at 15:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 8, 2015 at 1:53 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

David Rowley <dgrowleyml@gmail.com> wrote:

If we attempt to do this parallel stuff at plan time, and we
happen to plan at some quiet period, or perhaps worse, some
application's start-up process happens to PREPARE a load of
queries when the database is nice and quite, then quite possibly
we'll end up with some highly parallel queries. Then perhaps come
the time these queries are actually executed the server is very
busy... Things will fall apart quite quickly due to the masses of
IPC and context switches that would be going on.

I completely understand that this parallel query stuff is all
quite new to us all and we're likely still trying to nail down
the correct infrastructure for it to work well, so this is why
I'm proposing that the planner should know nothing of parallel
query, instead I think it should work more along the lines of:

* Planner should be completely oblivious to what parallel query
is.
* Before executor startup the plan is passed to a function which
decides if we should parallelise it, and does so if the plan
meets the correct requirements. This should likely have a very
fast exit path such as:
if root node's cost < parallel_query_cost_threshold
return; /* the query is not expensive enough to attempt to make

parallel */

The above check will allow us to have an almost zero overhead for
small low cost queries.

This function would likely also have some sort of logic in order
to determine if the server has enough spare resource at the
current point in time to allow queries to be parallelised

There is a lot to like about this suggestion.

I've seen enough performance crashes due to too many concurrent
processes (even when each connection can only use a single process)
to believe that, for a plan which will be saved, it is possible to
know at planning time whether parallelization will be a nice win or
a devastating over-saturation of resources during some later
execution phase.

Another thing to consider is that this is not entirely unrelated to
the concept of admission control policies. Perhaps this phase
could be a more general execution start-up admission control phase,
where parallel processing would be one adjustment that could be
considered.

I think there is always a chance that resources (like parallel-workers)
won't be available at run-time even if we decide about them at
executor-start phase unless we block it for that node's usage and OTOH
if we block it (by allocating) those resources during executor-start phase
then we might end up blocking it too early or may be they won't even get
used if we decide not to execute that node. On that basis, it seems to
me current strategy is not bad where we decide during planning time and
later during execution time if not all resources (particularly
parallel-workers)
are not available, then we use only the available one's to execute the
plan.
Going forward, I think we can improve the same if we decide not to shutdown
parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

Yeah, but what about when workers are not available in cases when the plan
was only a win because the planner thought there would be lots of
workers... There could have been a more optimal serial plan already thrown
out by the planner which is no longer available to the executor.

If the planner didn't know about parallelism then we'd already have the
most optimal plan and it would be no great loss if no workers were around
to help.

Regards

David Rowley

#256Amit Kapila
amit.kapila16@gmail.com
In reply to: David Rowley (#255)
Re: Parallel Seq Scan

On Wed, Apr 8, 2015 at 3:30 PM, David Rowley <dgrowleyml@gmail.com> wrote:

On 8 April 2015 at 15:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think there is always a chance that resources (like parallel-workers)
won't be available at run-time even if we decide about them at
executor-start phase unless we block it for that node's usage and OTOH
if we block it (by allocating) those resources during executor-start

phase

then we might end up blocking it too early or may be they won't even get
used if we decide not to execute that node. On that basis, it seems to
me current strategy is not bad where we decide during planning time and
later during execution time if not all resources (particularly

parallel-workers)

are not available, then we use only the available one's to execute the

plan.

Going forward, I think we can improve the same if we decide not to

shutdown

parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

Yeah, but what about when workers are not available in cases when the

plan was only a win because the planner thought there would be lots of
workers... There could have been a more optimal serial plan already thrown
out by the planner which is no longer available to the executor.

That could also happen even if we decide in executor-start phase.
I agree that there is a chance of loss incase appropriate resources
are not available during execution, but same is true for work_mem
as well for a non-parallel plan. I think we need some advanced way
to handle the case when resources are not available during execution
by either re-planing the statement or by some other way, but that can
also be done separately.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#257David Rowley
dgrowleyml@gmail.com
In reply to: Amit Kapila (#256)
Re: Parallel Seq Scan

On 9 April 2015 at 00:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Apr 8, 2015 at 3:30 PM, David Rowley <dgrowleyml@gmail.com> wrote:

On 8 April 2015 at 15:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think there is always a chance that resources (like parallel-workers)
won't be available at run-time even if we decide about them at
executor-start phase unless we block it for that node's usage and OTOH
if we block it (by allocating) those resources during executor-start

phase

then we might end up blocking it too early or may be they won't even get
used if we decide not to execute that node. On that basis, it seems to
me current strategy is not bad where we decide during planning time and
later during execution time if not all resources (particularly

parallel-workers)

are not available, then we use only the available one's to execute the

plan.

Going forward, I think we can improve the same if we decide not to

shutdown

parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

Yeah, but what about when workers are not available in cases when the

plan was only a win because the planner thought there would be lots of
workers... There could have been a more optimal serial plan already thrown
out by the planner which is no longer available to the executor.

That could also happen even if we decide in executor-start phase.

Yes this is true, but if we already have the most optimal serial plan, then
there's no issue.

I agree that there is a chance of loss incase appropriate resources
are not available during execution, but same is true for work_mem
as well for a non-parallel plan. I think we need some advanced way
to handle the case when resources are not available during execution
by either re-planing the statement or by some other way, but that can
also be done separately.

There was some talk of re-planning queries over on the Removing INNER JOINs
thread:
/messages/by-id/CA+TgmoaHi8tq7haZCf46O_NUHT8w=P0Z_N59DC0yOjfMucS9bg@mail.gmail.com

Regards

David Rowley

#258Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#252)
Re: Parallel Seq Scan

On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

One disadvantage of retaining parallel-paths could be that it can
increase the number of combinations planner might need to evaluate
during planning (in particular during join path evaluation) unless we
do some special handling to avoid evaluation of such combinations.

Yes, that's true. But the overhead might not be very much. In the
common case, many baserels and joinrels will have no parallel paths
because the non-parallel paths is known to be better anyway. Also, if
parallelism does seem to be winning, we're probably planning a query
that involves accessing a fair amount of data, so a little extra
planner overhead may not be so bad.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#259Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#253)
Re: Parallel Seq Scan

On Wed, Apr 8, 2015 at 3:34 AM, David Rowley <dgrowleyml@gmail.com> wrote:

On 8 April 2015 at 14:24, Robert Haas <robertmhaas@gmail.com> wrote:

I think one of the philosophical questions that has to be answered
here is "what does it mean to talk about the cost of a parallel
plan?". For a non-parallel plan, the cost of the plan means both "the
amount of effort we will spend executing the plan" and also "the
amount of time we think the plan will take to complete", but those two
things are different for parallel plans. I'm inclined to think it's
right to view the cost of a parallel plan as a proxy for execution
time, because the fundamental principle of the planner is that we pick
the lowest-cost plan. But there also clearly needs to be some way to
prevent the selection of a plan which runs slightly faster at the cost
of using vastly more resources.

I'd agree with that as far as CPU costs, or maybe I'd just disagree with the
alternative, as if we costed in <cost of individual worker's work> * <number
of workers> then we'd never choose a parallel plan, as by the time we costed
in tuple communication costs between the processes a parallel plan would
always cost more than the serial equivalent. I/O costs are different, I'd
imagine these shouldn't be divided by the estimated number of workers.

It's hard to say. If the I/O is from the OS buffer cache, then
there's no reason why several workers can't run in parallel. And
even if it's from the actual storage, we don't know what degree of I/O
parallelism will be possible. Maybe effective_io_concurrency should
play into the costing formula somehow, but it's not very clear to me
that captures the information we care about. In general, I'm not sure
how common it is for the execution speed of a sequential scan to be
limited by I/O.

For example, on a pgbench database, scale factor 300, on a POWERPC
machine provided by IBM for performance testing (thanks, IBM!) a
cached read of the pgbench_accounts files took 1.122 seconds. After
dropping the caches, it took 10.427 seconds. "select * from
pgbench_accounts where abalance > 30000" took 10.244 seconds with a
cold cache and 5.029 seconds with a warm cache. So on this particular
hardware, on this particular test, parallelism is useless if the cache
is cold, but it could be right to use ~4-5 processes for the scan if
the cache is warm. However, we have no way of knowing whether the
cache will be cold or warm at execution time.

This isn't a new problem. As it is, the user has to set seq_page_cost
and random_page_cost based on either a cold-cache assumption or a
warm-cache assumption, and if they guess wrong, their costing
estimates will be off (on this platform, on this test case) by 4-5x.
That's pretty bad, and it's totally unclear to me what to do about it.
I'm guessing it's unclear to other people, too, or we would likely
have done something about it by now.

Some ideas for GUCs:

max_parallel_degree = The largest number of processes we'll consider
using for a single query.
min_parallel_speedup = The minimum percentage by which a parallel path
must be cheaper (in terms of execution time) than a non-parallel path
in order to survive. I'm imagining the default here might be
something like 15%.
min_parallel_speedup_per_worker = Like the previous one, but per
worker. e.g. if this is 5%, which might be a sensible default, then a
plan with 4 workers must be at least 20% better to survive, but a plan
using only 2 workers only needs to be 10% better.

max_parallel_degree feels awfully like it would have to be set
conservatively, similar to how work_mem is today. Like with work_mem, during
quiet periods it sure would be nice if it could magically increase.

Absolutely. But, similar to work_mem, that's a really hard problem.
We can't know at plan time how much work memory, or how many CPUs,
will be available at execution time. And even if we did, it need not
be constant throughout the whole of query execution. It could be that
when execution starts, there's lots of memory available, so we do a
quicksort rather than a tape-sort. But midway through the machine
comes under intense memory pressure and there's no way for the system
to switch strategies.

Now, having said that, I absolutely believe that it's correct for the
planner to make the initial decisions in this area. Parallelism
changes the cost of execution nodes, and it's completely wrong to
assume that this couldn't alter planner decisions at higher levels of
the plan tree. At the same time, it's pretty clear that it would be a
great thing for the executor to be able to adjust the strategy if the
planner's assumptions don't pan out, or if conditions have changed.

For example, if we choose a seq-scan-sort-and-filter over an
index-scan-and-filter thinking that we'll be able to do a quicksort,
and then it turns out that we're short on memory, it's too late to
switch gears and adopt the index-scan-and-filter plan after all.
That's long since been discarded. But it's still better to switch to
a heap sort than to persist with a quicksort that's either going to
fail outright, or (maybe worse) succeed but drive the machine into
swap, which will just utterly obliterate performance.

An additional benefit of this line of thinking is that planning would
always produce a best non-parallel path. And sometimes, there would
also be a best parallel path that is expected to run faster. We could
then choose between them dynamically at execution time.

Actually store 2 plans within the plan? Like with an AlternativePlanNode?

Yeah. I'm not positive that's a good idea, but it seems like might be.

I think it's pretty hard to imagine a scenario as extreme as the one
you mention above ever actually occurring in practice. I mean, even
the most naive implementation of parallel query will presumably have
something like max_parallel_degree, and you probably won't have that
set to 128. For starters, it can't possibly make sense unless you
server has at least 128 CPUs, and even then it only makes sense if you
don't mind a single query using all of them, and even if the first of
those things is true, the second one probably isn't. I don't doubt
that less extreme variants of this scenario are possible, though.

Yeah maybe, it does seem quite extreme, but maybe less so as the years roll
on a bit... perhaps in 5-10 years it might be quite common to have that many
spare CPU cores to throw at a task.

That is certainly possible, but we need to start small. It's
completely OK for the first version of this feature to have some rough
edges that get improved later. Indeed, it's absolutely vital, or
we'll never get this thing off the ground.

I think if we have this percentage GUC you mentioned to prefer parallel
plans if they're within a % threshold of the serial plan, then we could end
up with problems with I/O and buffers getting thrown out of caches due to
the extra I/O involved in parallel plans going with seq scans instead of
serial plans choosing index scans.

That's possible, but the non-parallel planner doesn't account for
caching effects, either.

In summary it sounds like with my idea we get:

Pros
* Optimal plan if no workers are available at execution time.
* Parallelism possible if the chosen optimal plan happens to support
parallelism, e.g not index scan.
* No planning overhead

The third one isn't really true. You've just moved some of the
planning to execution time.

Cons:
* The plan "Parallelizer" must make changes to the plan just before
execution time, which ruins the 1 to 1 ratio of plan/executor nodes by the
time you inject Funnel nodes.

If we parallelise during planning time:

Pros
* More chance of getting a parallel friendly plan which could end up being
very fast if we get enough workers at executor time.

This, to me, is by far the biggest "con" of trying to do something at
execution time. If planning doesn't take into account the gains that
are possible from parallelism, then you'll only be able to come up
with the best parallel plan when it happens to be a parallelized
version of the best serial plan. So long as the only parallel
operator is parallel seq scan, that will probably be a common
scenario. But once we assemble a decent selection of parallel
operators, and a reasonably intelligent parallel query optimizer, I'm
not so sure it'll still be true.

Cons:
* May produce non optimal plans if no worker processes are available during
execution time.
* Planning overhead for considering parallel paths.
* The parallel plan may blow out buffer caches due to increased I/O of
parallel plan.

Of course please say if I've missed any pro or con.

I think I generally agree with your list; but we might not agree on
the relative importance of the items on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#260Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#254)
Re: Parallel Seq Scan

On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 08-04-2015 PM 12:46, Amit Kapila wrote:

Going forward, I think we can improve the same if we decide not to shutdown
parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

I wonder if it makes sense to invent the notion of a global pool of workers
with configurable number of workers that are created at postmaster start and
destroyed at shutdown and requested for use when a query uses parallelizable
nodes.

Short answer: Yes, but not for the first version of this feature.

Longer answer: We can't actually very reasonably have a "global" pool
of workers so long as we retain the restriction that a backend
connected to one database cannot subsequently disconnect from it and
connect to some other database instead. However, it's certainly a
good idea to reuse the same workers for subsequent operations on the
same database, especially if they are also by the same user. At the
very minimum, it would be good to reuse the same workers for
subsequent operations within the same query, instead of destroying the
old ones and creating new ones. Nonwithstanding the obvious value of
all of these ideas, I don't think we should do any of them for the
first version of this feature. This is too big a thing to get perfect
on the first try.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#261Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#260)
Re: Parallel Seq Scan

On 2015-04-21 AM 03:29, Robert Haas wrote:

On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote wrote:

On 08-04-2015 PM 12:46, Amit Kapila wrote:

Going forward, I think we can improve the same if we decide not to shutdown
parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

I wonder if it makes sense to invent the notion of a global pool of workers
with configurable number of workers that are created at postmaster start and
destroyed at shutdown and requested for use when a query uses parallelizable
nodes.

Short answer: Yes, but not for the first version of this feature.

Longer answer: We can't actually very reasonably have a "global" pool
of workers so long as we retain the restriction that a backend
connected to one database cannot subsequently disconnect from it and
connect to some other database instead. However, it's certainly a
good idea to reuse the same workers for subsequent operations on the
same database, especially if they are also by the same user. At the
very minimum, it would be good to reuse the same workers for
subsequent operations within the same query, instead of destroying the
old ones and creating new ones. Notwithstanding the obvious value of
all of these ideas, I don't think we should do any of them for the
first version of this feature. This is too big a thing to get perfect
on the first try.

Agreed.

Perhaps, Amit has worked (is working) on "reuse the same workers for
subsequent operations within the same query"

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#262David Rowley
dgrowleyml@gmail.com
In reply to: Robert Haas (#259)
Re: Parallel Seq Scan

On 21 April 2015 at 06:26, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 8, 2015 at 3:34 AM, David Rowley <dgrowleyml@gmail.com> wrote:

In summary it sounds like with my idea we get:

Pros
* Optimal plan if no workers are available at execution time.
* Parallelism possible if the chosen optimal plan happens to support
parallelism, e.g not index scan.
* No planning overhead

The third one isn't really true. You've just moved some of the
planning to execution time.

Hmm, sorry, I meant no planner overhead during normal planning.
I was more driving along the lines of the fact that low cost queries don't
have to pay the price for the planner considering parallel paths. This
"parallelizer" that I keep talking about would only be asked to do anything
if the root node's cost was above some GUC like parallel_cost_threshold,
and likely a default for this would be some cost that would translate into
a query that took, say roughly anything over 1 second. This way super fast
1 millisecond plans don't have to suffer from extra time taken to consider
parallel paths. Once we're processing queries that are above this parallel
threshold then the cost of the parallelizer invocation would be drowned out
by the actual execution cost anyway.

Cons:
* The plan "Parallelizer" must make changes to the plan just before
execution time, which ruins the 1 to 1 ratio of plan/executor nodes by

the

time you inject Funnel nodes.

If we parallelise during planning time:

Pros
* More chance of getting a parallel friendly plan which could end up

being

very fast if we get enough workers at executor time.

This, to me, is by far the biggest "con" of trying to do something at
execution time. If planning doesn't take into account the gains that
are possible from parallelism, then you'll only be able to come up
with the best parallel plan when it happens to be a parallelized
version of the best serial plan. So long as the only parallel
operator is parallel seq scan, that will probably be a common
scenario. But once we assemble a decent selection of parallel
operators, and a reasonably intelligent parallel query optimizer, I'm
not so sure it'll still be true.

I agree with that. It's a tough one.
I was hoping that this might be offset by the fact that we won't have to
pay the high price when the planner spits out a parallel plan when the
executor has no spare workers to execute it as intended, and also the we
wouldn't have to be nearly as conservative with the max_parallel_degree
GUC, that could just be set to the number of logical CPUs in the machine,
and we could just use that value minus number of active backends during
execution.

Cons:
* May produce non optimal plans if no worker processes are available

during

execution time.
* Planning overhead for considering parallel paths.
* The parallel plan may blow out buffer caches due to increased I/O of
parallel plan.

Of course please say if I've missed any pro or con.

I think I generally agree with your list; but we might not agree on
the relative importance of the items on it.

I've also been thinking about how, instead of having to have a special
PartialSeqScan node which contains a bunch of code to store tuples in a
shared memory queue, could we not have a "TupleBuffer", or
"ParallelTupleReader" node, one of which would always be the root node of a
plan branch that's handed off to a worker process. This node would just try
to keep it's shared tuple store full, and perhaps once it fills it could
have a bit of a sleep and be woken up when there's a bit more space on the
queue. When no more tuples were available from the node below this, then
the worker could exit. (providing there was no rescan required)

I think between the Funnel node and a ParallelTupleReader we could actually
parallelise plans that don't even have parallel safe nodes.... Let me
explain:

Let's say we have a 4 way join, and the join order must be {a,b}, {c,d} =>
{a,b,c,d}, Assuming the cost of joining a to b and c to d are around the
same, the Parallelizer may notice this and decide to inject a Funnel and
then ParallelTupleReader just below the node for c join d and have c join d
in parallel. Meanwhile the main worker process could be executing the root
node, as normal. This way the main worker wouldn't have to go to the
trouble of joining c to d itself as the worker would have done all that
hard work.

I know the current patch is still very early in the evolution of
PostgreSQL's parallel query, but how would that work with the current
method of selecting which parts of the plan to parallelise? I really think
the plan needs to be a complete plan before it can be best analysed on how
to divide the workload between workers, and also, it would be quite useful
to know how many workers are going to be able to lend a hand in order to
know best how to divide the plan up as evenly as possible.

Apologies if this seems like complete rubbish, or if it seems like parallel
query mark 3, when we're not done yet with mark 1. I just can't see how,
with the current approach how we could just parallelise normal plans like
the 4 way join I describe above and I think it would be a shame if we
developed down a path that made this not possible.

Regards

David Rowley

#263Amit Kapila
amit.kapila16@gmail.com
In reply to: David Rowley (#262)
Re: Parallel Seq Scan

On Tue, Apr 21, 2015 at 2:29 PM, David Rowley <dgrowleyml@gmail.com> wrote:

I've also been thinking about how, instead of having to have a special
PartialSeqScan node which contains a bunch of code to store tuples in a
shared memory queue, could we not have a "TupleBuffer", or
"ParallelTupleReader" node, one of which would always be the root node of a
plan branch that's handed off to a worker process. This node would just try
to keep it's shared tuple store full, and perhaps once it fills it could
have a bit of a sleep and be woken up when there's a bit more space on the
queue. When no more tuples were available from the node below this, then
the worker could exit. (providing there was no rescan required)

I think between the Funnel node and a ParallelTupleReader we could
actually parallelise plans that don't even have parallel safe nodes.... Let
me explain:

Let's say we have a 4 way join, and the join order must be {a,b}, {c,d} =>
{a,b,c,d}, Assuming the cost of joining a to b and c to d are around the
same, the Parallelizer may notice this and decide to inject a Funnel and
then ParallelTupleReader just below the node for c join d and have c join d
in parallel. Meanwhile the main worker process could be executing the root
node, as normal. This way the main worker wouldn't have to go to the
trouble of joining c to d itself as the worker would have done all that
hard work.

I know the current patch is still very early in the evolution of
PostgreSQL's parallel query, but how would that work with the current
method of selecting which parts of the plan to parallelise?

The Funnel node is quite generic and can handle the case as
described by you if we add Funnel on top of join node (c join d).
It currently passes plannedstmt to worker which can contain any
type of plan (though we need to add some more code to make it
work if want to execute any node other than Result or PartialSeqScan
node.)

I really think the plan needs to be a complete plan before it can be best
analysed on how to divide the workload between workers, and also, it would
be quite useful to know how many workers are going to be able to lend a
hand in order to know best how to divide the plan up as evenly as possible.

I think there is some advantage of changing an already built plan
to parallel plan based on resources and there is some literature
about the same, but I think we will loose much more by not considering
parallelism during planning time. If I remember correctly, then some
of the other databases do tackle this problem of shortage of resources
during execution as mentioned by me upthread, but I think for that it
is not necessary to have a Parallel Planner as a separate layer.
I believe it is important to have some way to handle shortage of resources
during execution, but it can be done at later stage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#264Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#261)
Re: Parallel Seq Scan

On Tue, Apr 21, 2015 at 6:34 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp

wrote:
On 2015-04-21 AM 03:29, Robert Haas wrote:

On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote wrote:

On 08-04-2015 PM 12:46, Amit Kapila wrote:

Going forward, I think we can improve the same if we decide not to

shutdown

parallel workers till postmaster shutdown once they are started and
then just allocate them during executor-start phase.

I wonder if it makes sense to invent the notion of a global pool of

workers

with configurable number of workers that are created at postmaster

start and

destroyed at shutdown and requested for use when a query uses

parallelizable

nodes.

Short answer: Yes, but not for the first version of this feature.

Agreed.

Perhaps, Amit has worked (is working) on "reuse the same workers for
subsequent operations within the same query"

What I am planning to do is Destroy the resources (parallel context) once
we have fetched all the tuples from Funnel node, so that we don't block
all resources till end of execution. We can't say that as reuse rather it
will allow multiple nodes in same statement to use workers when there
is a restriction on total number of workers (max_worker_processed) that
can be used.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#265Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#258)
Re: Parallel Seq Scan

On Mon, Apr 20, 2015 at 10:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

One disadvantage of retaining parallel-paths could be that it can
increase the number of combinations planner might need to evaluate
during planning (in particular during join path evaluation) unless we
do some special handling to avoid evaluation of such combinations.

Yes, that's true. But the overhead might not be very much. In the
common case, many baserels and joinrels will have no parallel paths
because the non-parallel paths is known to be better anyway. Also, if
parallelism does seem to be winning, we're probably planning a query
that involves accessing a fair amount of data,

Am I understanding right that by above you mean to say that retain the
parallel and non-parallel path only if parallel-path wins over non-parallel
path?

If yes, then I am able to understand the advantage of retaining both
parallel and non-parallel paths, else could you explain some more
why you think it is advantageous to retain parallel-path even when it
losses to serial path in the beginning?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#266Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#265)
Re: Parallel Seq Scan

On Tue, Apr 21, 2015 at 9:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Apr 20, 2015 at 10:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

One disadvantage of retaining parallel-paths could be that it can
increase the number of combinations planner might need to evaluate
during planning (in particular during join path evaluation) unless we
do some special handling to avoid evaluation of such combinations.

Yes, that's true. But the overhead might not be very much. In the
common case, many baserels and joinrels will have no parallel paths
because the non-parallel paths is known to be better anyway. Also, if
parallelism does seem to be winning, we're probably planning a query
that involves accessing a fair amount of data,

Am I understanding right that by above you mean to say that retain the
parallel and non-parallel path only if parallel-path wins over non-parallel
path?

Yes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#267Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#234)
1 attachment(s)
Re: Parallel Seq Scan

On Mon, Mar 30, 2015 at 8:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 18, 2015 at 11:43 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think I figured out the problem. That fix only helps in the case
where the postmaster noticed the new registration previously but
didn't start the worker, and then later notices the termination.
What's much more likely to happen is that the worker is started and
terminated so quickly that both happen before we create a
RegisteredBgWorker for it. The attached patch fixes that case, too.

Patch fixes the problem and now for Rescan, we don't need to Wait
for workers to finish.

I realized that there is a problem with this. If an error occurs in
one of the workers just as we're deciding to kill them all, then the
error won't be reported. Also, the new code to propagate
XactLastRecEnd won't work right, either. I think we need to find a
way to shut down the workers cleanly. The idea generally speaking
should be:

1. Tell all of the workers that we want them to shut down gracefully
without finishing the scan.

2. Wait for them to exit via WaitForParallelWorkersToFinish().

My first idea about how to implement this is to have the master detach
all of the tuple queues via a new function TupleQueueFunnelShutdown().
Then, we should change tqueueReceiveSlot() so that it does not throw
an error when shm_mq_send() returns SHM_MQ_DETACHED. We could modify
the receiveSlot method of a DestReceiver to return bool rather than
void; a "true" value can mean "continue processing" where as a "false"
value can mean "stop early, just as if we'd reached the end of the
scan".

I have implemented this idea (note that I have to expose a new API
shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
we sum_mq* to call shm_mq_detach) and apart this I have fixed other
problems reported on this thread:

1. Execution of initPlan by master backend and then pass the
required PARAM_EXEC parameter values to workers.
2. Avoid consuming dsm's by freeing the parallel context after
the last tuple is fetched.
3. Allow execution of Result node in worker backend as that can
be added as a gating filter on top of PartialSeqScan.
4. Merged parallel heap scan descriptor patch

To apply the patch, please follow below sequence:

HEAD Commit-Id: 4d930eee
parallel-mode-v9.patch [1]/messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
assess-parallel-safety-v4.patch [2]/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com (don't forget to run fixpgproc.pl in
the patch)
parallel_seqscan_v14.patch (Attached with this mail)

[1]: /messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
/messages/by-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
[2]: /messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
/messages/by-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v14.patchapplication/octet-stream; name=parallel_seqscan_v14.patchDownload
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..639451a 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -26,9 +26,9 @@
 
 static void printtup_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-static void printtup(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_20(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
 static void printtup_shutdown(DestReceiver *self);
 static void printtup_destroy(DestReceiver *self);
 
@@ -299,7 +299,7 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
  *		printtup --- print a tuple in protocol 3.0
  * ----------------
  */
-static void
+static bool
 printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -376,13 +376,15 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
  *		printtup_20 --- print a tuple in protocol 2.0
  * ----------------
  */
-static void
+static bool
 printtup_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -452,6 +454,8 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
@@ -528,7 +532,7 @@ debugStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		debugtup - print one tuple for an interactive backend
  * ----------------
  */
-void
+bool
 debugtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -553,6 +557,8 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
 		printatt((unsigned) i + 1, typeinfo->attrs[i], value);
 	}
 	printf("\t----\n");
+
+	return true;
 }
 
 /* ----------------
@@ -564,7 +570,7 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
  * This is largely same as printtup_20, except we use binary formatting.
  * ----------------
  */
-static void
+static bool
 printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -636,4 +642,6 @@ printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index da0b70e..388a8c6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,8 +81,10 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat, bool allow_sync,
 						bool is_bitmapscan, bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(ParallelHeapScanDesc);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -222,7 +225,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -481,7 +487,18 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -504,6 +521,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -656,11 +676,19 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -758,7 +786,18 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -778,6 +817,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -919,11 +961,19 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1316,7 +1366,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, false, false);
 }
 
@@ -1326,7 +1376,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, false, true);
 }
 
@@ -1335,7 +1385,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, false, false);
 }
 
@@ -1343,13 +1393,14 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, false);
 }
 
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat, bool allow_sync,
 						bool is_bitmapscan, bool temp_snap)
 {
@@ -1377,6 +1428,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1470,6 +1522,93 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = 0;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		A return value larger than the number of blocks to be scanned
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(ParallelHeapScanDesc parallel_scan)
+{
+	BlockNumber	page = InvalidBlockNumber;
+
+	/* we treat InvalidBlockNumber specially here to avoid overflow */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock != InvalidBlockNumber)
+		page = parallel_scan->phs_cblock++;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot		snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, false, true);
+}
+
+/* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 0d3721a..612c469 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4393,7 +4393,7 @@ copy_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * copy_dest_receive --- receive one tuple
  */
-static void
+static bool
 copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_copy    *myState = (DR_copy *) self;
@@ -4405,6 +4405,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 	/* And send the data */
 	CopyOneRowTo(cstate, InvalidOid, slot->tts_values, slot->tts_isnull);
 	myState->processed++;
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 54b2f38..68db546 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,7 +62,7 @@ typedef struct
 static ObjectAddress CreateAsReladdr = {InvalidOid, InvalidOid, 0};
 
 static void intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void intorel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool intorel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void intorel_shutdown(DestReceiver *self);
 static void intorel_destroy(DestReceiver *self);
 
@@ -482,7 +482,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * intorel_receive --- receive one tuple
  */
-static void
+static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
@@ -507,6 +507,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 771f6a8..cdf172c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -721,6 +721,8 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -916,6 +918,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1065,6 +1073,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1206,6 +1216,24 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for parallel sequence scan.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+	{
+		int i;
+		Instrumentation *instrument_worker;
+		int nworkers = ((FunnelState *)planstate)->pcxt->nworkers;
+		char *inst_info_workers = ((FunnelState *)planstate)->inst_options_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(planstate->instrument, instrument_worker);
+		}
+	}
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1322,6 +1350,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
 			break;
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -1331,6 +1360,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2218,6 +2255,8 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index eb16bb3..78f822b 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -56,7 +56,7 @@ typedef struct
 static int	matview_maintenance_depth = 0;
 
 static void transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static void refresh_matview_datafill(DestReceiver *dest, Query *query,
@@ -422,7 +422,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * transientrel_receive --- receive one tuple
  */
-static void
+static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
@@ -441,6 +441,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index af707b0..991ff51 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -16,14 +16,15 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
        execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
-       nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
+       nodeSeqscan.o nodePartialSeqscan.o nodeSetOp.o nodeSort.o \
+       nodeUnique.o nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o \
+       spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 6ebad2f..10dc319 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -37,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -155,6 +157,14 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -458,6 +468,10 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index d87be96..657b928 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index dc7d506..9338591 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -318,6 +318,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	operation = queryDesc->operation;
 	dest = queryDesc->dest;
 
+	/* inform executor to collect buffer usage stats from parallel workers. */
+	estate->total_time = queryDesc->totaltime ? 1 : 0;
+
 	/*
 	 * startup tuple receiver, if we will be emitting tuples
 	 */
@@ -1550,7 +1553,15 @@ ExecutePlan(EState *estate,
 		 * practice, this is probably always the case at this point.)
 		 */
 		if (sendTuples)
-			(*dest->receiveSlot) (slot, dest);
+		{
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
+		}
 
 		/*
 		 * Count tuples processed, if this is a SELECT.  (For other operation
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9892499..1a1275c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSeqscan.h"
@@ -190,6 +192,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -406,6 +418,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -644,6 +664,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 753754d..c874a27 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1266,7 +1266,7 @@ do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull)
 	ExecStoreVirtualTuple(slot);
 
 	/* send the tuple to the receiver */
-	(*tstate->dest->receiveSlot) (slot, tstate->dest);
+	(void) (*tstate->dest->receiveSlot) (slot, tstate->dest);
 
 	/* clean up */
 	ExecClearTuple(slot);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 0736d2a..fdb2c82 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -1512,3 +1512,28 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 
 	MemoryContextSwitchTo(oldcontext);
 }
+
+/*
+ * Populate the values of PARAM_EXEC parameters.
+ *
+ * This is used by worker backends to fill in the values
+ * of PARAM_EXEC parameters after fetching the same from
+ * dynamic shared memory.  This needs to be called before
+ * ExecutorRun.
+ */
+void
+PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals)
+{
+	ListCell	*lparam;
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].value =
+																param_val->value;
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].isnull =
+																param_val->isnull;
+	}
+}
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..863bd64 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -167,7 +167,7 @@ static Datum postquel_get_single_result(TupleTableSlot *slot,
 static void sql_exec_error_callback(void *arg);
 static void ShutdownSQLFunction(Datum arg);
 static void sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
 static void sqlfunction_shutdown(DestReceiver *self);
 static void sqlfunction_destroy(DestReceiver *self);
 
@@ -1903,7 +1903,7 @@ sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * sqlfunction_receive --- receive one tuple
  */
-static void
+static bool
 sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_sqlfunction *myState = (DR_sqlfunction *) self;
@@ -1913,6 +1913,8 @@ sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Store the filtered tuple into the tuplestore */
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..283a136 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,30 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +167,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..a08e24c
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,394 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		rescans a relation
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/nodeFunnel.h"
+#include "executor/nodeSubplan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	 /* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution.
+	 * We do this on first execution rather than during node initialization,
+	 * as it needs to allocate large dynamic segement, so it is better to 
+	 * do if it is really needed.
+	 */
+	if (!node->pcxt)
+	{
+		EState		*estate = node->ss.ps.state;
+		ExprContext *econtext = node->ss.ps.ps_ExprContext;
+		bool		any_worker_launched = false;
+		List		*serialized_param_exec;
+
+		/*
+		 * Evaluate the InitPlan and pass the PARAM_EXEC params, so that
+		 * values can be shared with worker backend.  This is different
+		 * from the way InitPlans are evaluated (lazy evaluation) at other
+		 * places as instead of sharing the InitPlan to all the workers
+		 * and let them execute, we pass the values which can be directly
+		 * used by worker backends.
+		 */
+		serialized_param_exec = ExecAndFormSerializeParamExec(econtext,
+											node->ss.ps.plan->lefttree->allParam);
+
+		/* Initialize the workers required to execute funnel node. */
+		InitializeParallelWorkers(node->ss.ps.plan->lefttree,
+								  serialized_param_exec,
+								  estate,
+								  node->ss.ss_currentRelation,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are
+		 * not available then we perform the scan with available workers and
+		 * If there are no more workers available, then the funnel node will
+		 * just scan locally.
+		 */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+	
+	slot = funnel_getnext(node);
+	
+	/*
+	 * if required by plugin, aggregate the buffer usage stats
+	 * from all workers.
+	 */
+	if (TupIsNull(slot))
+	{
+		int i;
+		int nworkers;
+		BufferUsage *buffer_usage_worker;
+		char *buffer_usage;
+
+		if (node->ss.ps.state->total_time)
+		{
+			nworkers = node->pcxt->nworkers;
+			buffer_usage = node->buffer_usage_space;
+
+			for (i = 0; i < nworkers; i++)
+			{
+				buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * sizeof(BufferUsage)));
+				BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+			}
+		}
+
+		/*
+		 * Destroy the parallel context once we complete fetching all
+		 * the tuples, this will ensure that if in the same statement
+		 * we need to have Funnel node for multiple parts of statement,
+		 * it won't accumulate lot of dsm's and workers can be made
+		 * available to use by other parts of statement.
+		 */
+		if (node->pcxt)
+		{
+			/*
+			 * Ensure all workers have finished before destroying the parallel
+			 * context to ensure a clean exit.
+			 */
+			if (node->fs_workersReady)
+				WaitForParallelWorkersToFinish(node->pcxt);
+
+			/* destroy the tuple queue */
+			DestroyTupleQueueFunnel(node->funnel);
+			node->funnel = NULL;
+
+			/* destroy parallel context. */
+			DestroyParallelContext(node->pcxt);
+			node->pcxt = NULL;
+
+			node->fs_workersReady = false;
+			node->all_workers_done = false;
+			node->local_scan_done = false;
+		}
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+}
+
+/*
+ * funnel_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in funnel scan and if there is no
+ *  data available from queues or no worker is available, it does
+ *  fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while ((!funnelstate->all_workers_done  && funnelstate->fs_workersReady) ||
+			!funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans a relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.
+	 */
+	if (node->pcxt)
+	{
+		/*
+		 * We want to gracefully shutdown all the workers so that they should
+		 * be able to propagate any error or other information to master backend
+		 * before dying.
+		 */
+		if (node->fs_workersReady)
+		{
+			TupleQueueFunnelShutdown(node->funnel);
+			WaitForParallelWorkersToFinish(node->pcxt);
+		}
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+		node->funnel = NULL;
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+		node->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+
+	ExecReScan(node->ss.ps.lefttree);
+}
+
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index e66bcda..c447062 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -144,6 +144,7 @@ ExecNestLoop(NestLoopState *node)
 			{
 				NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
 				int			paramno = nlp->paramno;
+				TupleDesc	tdesc = outerTupleSlot->tts_tupleDescriptor;
 				ParamExecData *prm;
 
 				prm = &(econtext->ecxt_param_exec_vals[paramno]);
@@ -154,6 +155,7 @@ ExecNestLoop(NestLoopState *node)
 				prm->value = slot_getattr(outerTupleSlot,
 										  nlp->paramval->varattno,
 										  &(prm->isnull));
+				prm->ptype = tdesc->attrs[nlp->paramval->varattno-1]->atttypid;
 				/* Flag parameter value as changed */
 				innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
 													 paramno);
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..730ee2e
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/nodePartialSeqscan.h"
+#include "postmaster/backendworker.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	shm_toc		*toc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.  We set 'toc' (place to lookup parallel scan
+	 * descriptor) as retrievied by attaching to dsm for parallel workers
+	 * whereas master backend stores it directly in partial scan state node
+	 * after initializing workers. 
+	 */
+	toc = GetParallelShmToc();
+	if (toc)
+		node->ss.ps.toc = toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize
+	 * it during ExecutorStart phase, however we need ParallelHeapScanDesc
+	 * to initialize the scan in case of this node and the same is
+	 * initialized by the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic shared
+		 * memory segment by master backend, parallel workers and local scan by
+		 * master backend retrieve it from shared memory.  If the scan descriptor
+		 * is available on first execution, then we need to re-initialize for
+		 * rescan.
+		 */
+		Assert(node->ss.ps.toc);
+	
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b348bfd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -75,6 +75,13 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Result node can be added as a gating node on top of PartialSeqScan
+	 * node, so need to percolate toc information to outer node.
+	 */
+	if (node->ps.toc)
+		outerPlanState(node)->toc = node->ps.toc;
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9eb4d63..6afd55a 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -30,11 +30,14 @@
 #include <math.h>
 
 #include "access/htup_details.h"
+#include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeSubplan.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
@@ -281,12 +284,14 @@ ExecScanSubPlan(SubPlanState *node,
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -399,6 +404,7 @@ ExecScanSubPlan(SubPlanState *node,
 			prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col, &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 
@@ -551,6 +557,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		 !TupIsNull(slot);
 		 slot = ExecProcNode(planstate))
 	{
+		TupleDesc	tdesc = slot->tts_tupleDescriptor;
 		int			col = 1;
 		ListCell   *plst;
 		bool		isnew;
@@ -568,6 +575,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col,
 										  &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 		slot = ExecProject(node->projRight, NULL);
@@ -954,6 +962,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *l;
 	bool		found = false;
 	ArrayBuildStateAny *astate = NULL;
+	Oid			ptype;
 
 	if (subLinkType == ANY_SUBLINK ||
 		subLinkType == ALL_SUBLINK)
@@ -961,6 +970,8 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	if (subLinkType == CTE_SUBLINK)
 		elog(ERROR, "CTE subplans should not be executed via ExecSetParamPlan");
 
+	ptype = exprType((Node *) node->xprstate.expr);
+
 	/* Initialize ArrayBuildStateAny in caller's context, if needed */
 	if (subLinkType == ARRAY_SUBLINK)
 		astate = initArrayResultAny(subplan->firstColType,
@@ -983,12 +994,14 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -1011,6 +1024,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(true);
+			prm->ptype = ptype;
 			prm->isnull = false;
 			found = true;
 			break;
@@ -1062,6 +1076,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 			prm->execPlan = NULL;
 			prm->value = heap_getattr(node->curTuple, i, tdesc,
 									  &(prm->isnull));
+			prm->ptype = tdesc->attrs[i-1]->atttypid;
 			i++;
 		}
 	}
@@ -1084,6 +1099,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 											true);
 		prm->execPlan = NULL;
 		prm->value = node->curArray;
+		prm->ptype = ptype;
 		prm->isnull = false;
 	}
 	else if (!found)
@@ -1096,6 +1112,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(false);
+			prm->ptype = ptype;
 			prm->isnull = false;
 		}
 		else
@@ -1108,6 +1125,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 				prm->execPlan = NULL;
 				prm->value = (Datum) 0;
+				prm->ptype = VOIDOID;
 				prm->isnull = true;
 			}
 		}
@@ -1238,3 +1256,47 @@ ExecAlternativeSubPlan(AlternativeSubPlanState *node,
 					   isNull,
 					   isDone);
 }
+
+/*
+ * ExecAndFormSerializeParamExec
+ *
+ * Execute the subplan stored in PARAM_EXEC param if it is not executed
+ * till now and form the serialized structure required for passing to
+ * worker backend.
+ */
+List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params)
+{
+	List	*lparam = NIL;
+	SerializedParamExecData *sparamdata;
+	ParamExecData *prm;
+	int		paramid;
+
+	paramid = -1;
+	while ((paramid = bms_next_member(params, paramid)) >= 0)
+	{
+		/*
+		 * PARAM_EXEC params (internal executor parameters) are stored in the
+		 * ecxt_param_exec_vals array, and can be accessed by array index.
+		 */
+		sparamdata = palloc0(sizeof(SerializedParamExecData));
+
+		prm = &(econtext->ecxt_param_exec_vals[paramid]);
+		if (prm->execPlan != NULL)
+		{
+			/* Parameter not evaluated yet, so go do it */
+			ExecSetParamPlan(prm->execPlan, econtext);
+			/* ExecSetParamPlan should have processed this param... */
+			Assert(prm->execPlan == NULL);
+		}
+
+		sparamdata->paramid	= paramid;
+		sparamdata->ptype = prm->ptype;
+		sparamdata->value = prm->value;
+		sparamdata->isnull = prm->isnull;
+
+		lparam = lappend(lparam, sparamdata);
+	}
+
+	return lparam;
+}
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 557d153..60cfab4 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1753,7 +1753,7 @@ spi_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		store tuple retrieved by Executor into SPITupleTable
  *		of current SPI procedure
  */
-void
+bool
 spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	SPITupleTable *tuptable;
@@ -1787,6 +1787,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 	(tuptable->free)--;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..2be48f4
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,304 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static bool
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result == SHM_MQ_DETACHED)
+		return false;
+	else if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+
+	return true;
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Detach all tuple queues that belong to funnel.
+ */
+void
+TupleQueueFunnelShutdown(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		int		i;
+		shm_mq_handle *mqh;
+		shm_mq	   *mq;
+		for (i = 0; i < funnel->nqueues; i++)
+		{
+			mqh = funnel->queue[i];
+			mq = shm_mq_from_handle(mqh);
+			shm_mq_detach(mq);
+		}
+	}
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is until the number of remaining queues fall below that pointer and
+		 * at that point make the pointer point to the first queue.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+
+			memmove(&funnel->queue[funnel->nextqueue],
+					&funnel->queue[funnel->nextqueue + 1],
+					sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+
+			if (funnel->nextqueue >= funnel->nqueues)
+				funnel->nextqueue = 0;
+
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+
+			continue;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/executor/tstoreReceiver.c b/src/backend/executor/tstoreReceiver.c
index c1fdeb7..b0862ae 100644
--- a/src/backend/executor/tstoreReceiver.c
+++ b/src/backend/executor/tstoreReceiver.c
@@ -37,8 +37,8 @@ typedef struct
 } TStoreState;
 
 
-static void tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
-static void tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
 
 
 /*
@@ -90,19 +90,21 @@ tstoreStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the easy case where we don't have to detoast.
  */
-static void
+static bool
 tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
 
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the case where we have to detoast any toasted values.
  */
-static void
+static bool
 tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
@@ -152,6 +154,8 @@ tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 	/* And release any temporary detoasted values */
 	for (i = 0; i < nfree; i++)
 		pfree(DatumGetPointer(myState->tofree[i]));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8c9a0e..3c0123a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -355,6 +355,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4049,6 +4086,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1aa1f55..05d4b3c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -440,6 +440,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -2898,6 +2916,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..0050195 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, same as in original Param */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,355 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+						num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
+
+/*
+ * Estimate the amount of space required to serialize the PARAM_EXEC
+ * parameters.
+ */
+Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals)
+{
+	Size		size;
+	ListCell	*lparam;
+
+	/*
+	 * Add space required for saving number of PARAM_EXEC parameters
+	 * that needs to be serialized.
+	 */
+	size = sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		length;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		length = sizeof(SerializedParamExecData);
+
+		get_typlenbyval(param_val->ptype, &typLen, &typByVal);
+
+		/*
+		 * pass-by-value parameters are directly stored in
+		 * SerializedParamExternData, so no need of additional
+		 * space for them.
+		 */
+		if (!(typByVal || param_val->isnull))
+		{
+			length += datumGetSize(param_val->value, typByVal, typLen);
+			size = add_size(size, length);
+
+			/* Allow space for terminating zero-byte */
+			size = add_size(size, 1);
+		}
+		else
+			size = add_size(size, length);
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the PARAM_EXEC parameters into the memory, beginning at
+ * start_address.  maxsize should be at least as large as the value
+ * returned by EstimateExecParametersSpace.
+ */
+void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExecData *retval;
+	ListCell	*lparam;
+
+	/*
+	 * First, we store the number of PARAM_EXEC parameters that needs to
+	 * be serialized.
+	 */
+	if (serialized_param_exec_vals)
+		* (int *) start_address = list_length(serialized_param_exec_vals);
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+
+	curptr = start_address + sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		retval = (SerializedParamExecData*) curptr;
+
+		retval->paramid	= param_val->paramid;
+		retval->value = param_val->value;
+		retval->isnull = param_val->isnull;
+		retval->ptype = param_val->ptype;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(retval->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(retval->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(retval->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreExecParams
+ *		Restore PARAM_EXEC parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+List *
+RestoreExecParams(char *start_address)
+{
+	List			*lparamexecvals = NIL;
+	//Size			size;
+	int				num_params,i;
+	char			*curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExecData *nprm;
+		SerializedParamExecData	*outparam;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExecData *) curptr;
+
+		outparam = palloc0(sizeof(SerializedParamExecData));
+
+		/* copy the parameter info */
+		outparam->isnull = nprm->isnull;
+		outparam->value = nprm->value;
+		outparam->paramid = nprm->paramid;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			outparam->value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+
+		lparamexecvals = lappend(lparamexecvals, outparam);
+	}
+
+	return lparamexecvals;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 563209c..f757e92 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1280,6 +1280,124 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlan
+ */
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readScan
+ */
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
+
+/*
+ * _readResult
+ */
+static Result *
+_readResult(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(Result);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_NODE_FIELD(resconstantqual);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1409,6 +1527,14 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
+	else if (MATCH("RESULT", 6))
+		return_value = _readResult();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 58d78e6..528727c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -410,6 +410,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1a0d358..874c272 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,9 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
+ *  parallel_startup_cost  Cost of starting up parallel workers
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +104,16 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
+double		parallel_startup_cost = DEFAULT_PARALLEL_STARTUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -220,6 +228,55 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of scanning a relation parallely.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info,
+			int nWorkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * parallel sequiantial scan.
+	 */
+	run_cost = run_cost / (nWorkers + 1);
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	startup_cost += parallel_startup_cost * nWorkers;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..949e79b
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int num_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0
+	 * and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * There should be atleast one page to scan for each worker.
+	 */
+	if (parallel_seqscan_degree <= rel->pages)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = rel->pages;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cb69c03..6dd43f3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,10 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -100,6 +104,12 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist,
+										   List *qpqual,
+										   Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+						   Index scanrelid, int nworkers,
+						   Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -228,6 +238,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -267,6 +278,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 			plan = create_unique_plan(root,
 									  (UniquePath *) best_path);
 			break;
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
 				 (int) best_path->pathtype);
@@ -343,6 +358,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -546,6 +568,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1133,6 +1157,107 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path)
+{
+	Scan    *scan_plan;
+	Plan	*subplan;
+	List	*tlist;
+	RelOptInfo *rel = best_path->path.parent;
+	Index	scan_relid = best_path->path.parent->relid;
+
+	/*
+	 * For table scans, rather than using the relation targetlist (which is
+	 * only those Vars actually needed by the query), we prefer to generate a
+	 * tlist containing all Vars in order.  This will allow the executor to
+	 * optimize away projection of the table tuples, if possible.  (Note that
+	 * planner.c may replace the tlist we generate here, forcing projection to
+	 * occur.)
+	 */
+	if (use_physical_tlist(root, rel))
+	{
+			tlist = build_physical_tlist(root, rel);
+			/* if fail because of dropped cols, use regular method */
+			if (tlist == NIL)
+				tlist = build_path_tlist(root, &best_path->path);
+	}
+	else
+	{
+		tlist = build_path_tlist(root, &best_path->path);
+	}
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3321,6 +3446,45 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1824e7b..d16fa09 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -275,6 +275,52 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+								   List *rangetable,
+								   int num_exec_params)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = num_exec_params;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 94b12ab..e26b248 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -435,6 +435,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -445,6 +446,26 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.  We don't want to do this assignment after
+				 * fixing references as that will be done separately for partial
+				 * sequence scan node.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
@@ -2062,6 +2083,45 @@ fix_opfuncids_walker(Node *node, void *context)
 }
 
 /*
+ * fix_node_funcids
+ *		Set the opfuncid (procedure OID) in an OpExpr node,
+ *		for plan tree.
+ *
+ * We need it mainly to fix the opfuncid in nodes of plantree
+ * after reading the planned statement by worker backend.
+ * Currently the support of nodes that could be executed by
+ * worker backend are limited, so we can enhance this API based
+ * on it's usage in future.
+ */
+void
+fix_node_funcids(Plan *node)
+{
+	/*
+	 * do nothing when we get to the end of a leaf on tree.
+	 */
+	if (node == NULL)
+		return;
+
+	fix_opfuncids((Node*) node->qual);
+	fix_opfuncids((Node*) node->targetlist);
+
+	switch (nodeTag(node))
+	{
+		case T_Result:
+			fix_opfuncids((Node*) (((Result *)node)->resconstantqual));
+			break;
+		case T_PartialSeqScan:
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	fix_node_funcids(node->lefttree);
+	fix_node_funcids(node->righttree);
+}
+
+/*
  * set_opfuncid
  *		Set the opfuncid (procedure OID) in an OpExpr node,
  *		if it hasn't been set already.
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index acfd0bc..f649639 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2167,6 +2167,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index faca30b..0e5fd3a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -706,6 +706,53 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+							Path* subpath, int nWorkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nWorkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info, nWorkers);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..f056bd5 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o backendworker.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/backendworker.c b/src/backend/postmaster/backendworker.c
new file mode 100644
index 0000000..f4f6235
--- /dev/null
+++ b/src/backend/postmaster/backendworker.c
@@ -0,0 +1,471 @@
+/*-------------------------------------------------------------------------
+ *
+ * backendworker.c
+ *	  Support routines for setting up backend workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/backendworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/nodeFunnel.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "postmaster/backendworker.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size);
+static void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * This is required for parallel plan execution to fetch the
+ * information from dsm.
+ */
+static shm_toc *parallel_shm_toc = NULL;
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters, PARAM_EXEC parameters and instrumentation
+ * information that need to be retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	*params_exec_size = EstimateExecParametersSpace(serialized_param_exec_vals);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_exec_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the usage along with it's own, so account it for each worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 4);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters, PARAM_EXEC parameters and instrumentation
+ * information required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	*paramsdata;
+	char	*paramsexecdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Store PARAM_EXEC parameters list in dynamic shared memory.  This is
+	 * used for evaluation plan->initPlan params.
+	 */
+	paramsexecdata = shm_toc_allocate(pcxt->toc, params_exec_size);
+	SerializeExecParams(serialized_param_exec_vals, params_exec_size, paramsexecdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS_EXEC, paramsexecdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by
+	 * each worker.
+	 */
+	*buffer_usage_space =
+			shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePartialSeqScanSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel heap scan descriptor that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePartialSeqScanSpace(ParallelContext *pcxt, EState *estate,
+							char *plannedstmt_str, Size *plannedstmt_len,
+							Size *pscan_size)
+{
+	/* Estimate space for partial seq. scan specific contents. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+}
+
+/*
+ * StorePartialSeqScan
+ * 
+ * Sets up the planned statement and block range for parallel
+ * sequence scan.
+ */
+void
+StorePartialSeqScan(ParallelContext *pcxt, EState *estate, Relation rel,
+					char *plannedstmt_str, Size plannedstmt_size,
+					Size pscan_size)
+{
+	char		*plannedstmtdata;
+	ParallelHeapScanDesc pscan;
+
+	/* Store range table list in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	/* Store parallel heap scan descriptor in dynamic shared memory. */
+	pscan = shm_toc_allocate(pcxt->toc, pscan_size);
+	heap_parallelscan_initialize(pscan, rel, estate->es_snapshot);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(Plan *plan, List *serialized_param_exec_vals,
+						  EState *estate, Relation rel,
+						  char **inst_options_space,
+						  char **buffer_usage_space,
+						  shm_mq_handle ***responseqp,
+						  ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size, params_exec_size, pscan_size, plannedstmt_size;
+	char		*plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt((PartialSeqScan *)plan,
+													 estate->es_range_table,
+													 estate->es_plannedstmt->nParamExec);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePartialSeqScanSpace(pcxt, estate, plannedstmt_str,
+								&plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 serialized_param_exec_vals,
+									 estate->es_instrument, &params_size,
+									 &params_exec_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePartialSeqScan(pcxt, estate, rel, plannedstmt_str,
+						plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 serialized_param_exec_vals,
+							 estate->es_instrument,
+							 params_size,
+							 params_exec_size,
+							 inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameters, PARAM_EXEC parameters and
+ * instrumentation information required to perform parallel
+ * operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char		*paramsdata;
+	char		*paramsexecdata;
+	char		*inst_options_space;
+	char		*buffer_usage_space;
+	int			*instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	paramsexecdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS_EXEC);
+	instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+
+	*params = RestoreBoundParams(paramsdata);
+
+	*serialized_param_exec_vals = RestoreExecParams(paramsexecdata);
+
+	*inst_options = *instoptions;
+	if (inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+			ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+
+	buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+	*buffer_usage = (buffer_usage_space +
+					 ParallelWorkerNumber * sizeof(BufferUsage));
+}
+
+/*
+ * ExecParallelGetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_node_funcids((*plannedstmt)->planTree);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * GetParallelShmToc
+ */
+shm_toc *
+GetParallelShmToc(void)
+{
+	return parallel_shm_toc;
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	int				inst_options;
+	char			*instrument = NULL;
+	char			*buffer_usage = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	ExecParallelGetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &serialized_param_exec_vals,
+						   &inst_options, &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->serialized_param_exec_vals = serialized_param_exec_vals;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->responseq = responseq;
+
+	parallel_shm_toc = toc;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a9f20ac..8a759a4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index d42a8d1..f640bb2 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -746,6 +746,15 @@ shm_mq_detach(shm_mq *mq)
 }
 
 /*
+ * Get the shm_mq from handle.
+ */
+shm_mq *
+shm_mq_from_handle(shm_mq_handle *mqh)
+{
+	return mqh->mqh_queue;
+}
+
+/*
  * Write bytes into a shared message queue.
  */
 static shm_mq_result
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..ba70bce 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -44,9 +45,10 @@
  *		dummy DestReceiver functions
  * ----------------
  */
-static void
+static bool 
 donothingReceive(TupleTableSlot *slot, DestReceiver *self)
 {
+	return true;
 }
 
 static void
@@ -129,6 +131,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +167,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +210,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +255,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c18298..516c391 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -55,6 +56,7 @@
 #include "pg_getopt.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
+#include "postmaster/backendworker.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -1192,6 +1194,98 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										"worker plan",
+										ALLOCSET_DEFAULT_MINSIZE,
+										ALLOCSET_DEFAULT_INITSIZE,
+										ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	if (parallelstmt->inst_options)
+		receiver = None_Receiver;
+	else
+	{
+		receiver = CreateDestReceiver(DestTupleQueue);
+		SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+	}
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	PopulateParamExecParams(queryDesc, parallelstmt->serialized_param_exec_vals);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required
+	 * by plugins to report the total usage for statement execution.
+	 */
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..f2fb638 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1121,7 +1121,13 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			if (!ok)
 				break;
 
-			(*dest->receiveSlot) (slot, dest);
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
 
 			ExecClearTuple(slot);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8727ee3..0a10ebe 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -602,6 +602,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2551,6 +2553,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2738,6 +2750,36 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_startup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "starting parallel workers."),
+			NULL
+		},
+		&parallel_startup_cost,
+		DEFAULT_PARALLEL_STARTUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 110983f..06c5969 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -291,6 +291,9 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
+#parallel_startup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -501,6 +504,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 888cce7..0a34b48 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -95,8 +95,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -116,9 +117,15 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/printtup.h b/src/include/access/printtup.h
index 46c4148..92ec882 100644
--- a/src/include/access/printtup.h
+++ b/src/include/access/printtup.h
@@ -25,11 +25,11 @@ extern void SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist,
 
 extern void debugStartup(DestReceiver *self, int operation,
 			 TupleDesc typeinfo);
-extern void debugtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool debugtup(TupleTableSlot *slot, DestReceiver *self);
 
 /* XXX these are really in executor/spi.c */
 extern void spi_dest_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-extern void spi_printtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool spi_printtup(TupleTableSlot *slot, DestReceiver *self);
 
 #endif   /* PRINTTUP_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9bb6362..f459020 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,15 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber	phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -48,6 +57,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel; /* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c1e7477..aef04a2 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -271,6 +271,8 @@ extern TupleDesc ExecCleanTypeFromTL(List *targetList, bool hasoid);
 extern TupleDesc ExecTypeFromExprList(List *exprList);
 extern void ExecTypeSetColNames(TupleDesc typeInfo, List *namesList);
 extern void UpdateChangedParamSet(PlanState *node, Bitmapset *newchg);
+extern void PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals);
 
 typedef struct TupOutputState
 {
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c9a2129..0c7847d 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+	InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..3af3a0e
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cb05be7
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index 3732ad4..21c745e 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -24,4 +24,7 @@ extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
 
 extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
 
+extern List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params);
+
 #endif   /* NODESUBPLAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..d2ddb6e
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void TupleQueueFunnelShutdown(TupleQueueFunnel *funnel);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ac75f86..cd79588 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
 #include "utils/reltrigger.h"
@@ -389,6 +391,18 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
+	 * This is required to collect buffer usage stats from parallel
+	 * workers when requested by plugins.
+	 */
+	bool		total_time;	/* total time spent in ExecutorRun */
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1016,6 +1030,11 @@ typedef struct PlanState
 	 * State for management of parameter-change-driven rescanning
 	 */
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
+	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc			*toc;
 
 	/*
 	 * Other run-time state needed by most if not all node types.
@@ -1216,6 +1235,45 @@ typedef struct ScanState
 typedef ScanState SeqScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	char			*buffer_usage_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 38469ef..3f3d572 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -97,6 +99,8 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -217,6 +221,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..21c6f7a 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -14,6 +14,8 @@
 #ifndef PARAMS_H
 #define PARAMS_H
 
+#include "nodes/pg_list.h"
+
 /* To avoid including a pile of parser headers, reference ParseState thus: */
 struct ParseState;
 
@@ -96,11 +98,47 @@ typedef struct ParamExecData
 {
 	void	   *execPlan;		/* should be "SubPlanState *" */
 	Datum		value;
+	/*
+	 * parameter's datatype, or 0.  This is required so that
+	 * datum value can be read and used for other purposes like
+	 * passing it to worker backend via shared memory.  This is
+	 * required only for evaluation of initPlan's, however for
+	 * consistency we set this for Subplan as well.  We left it
+	 * for other cases like CTE or RecursiveUnion cases where this
+	 * structure is not used for evaluation of subplans.
+	 */
+	Oid			ptype;
 	bool		isnull;
 } ParamExecData;
 
+/*
+ * This structure is used to pass PARAM_EXEC parameters to backend
+ * workers.  For each PARAM_EXEC parameter, pass this structure
+ * followed by value except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExecData
+{
+	int			paramid;			/* parameter id of this param */
+	Size		length;			/* length of parameter value */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+	Datum		value;
+	bool		isnull;
+} SerializedParamExecData;
+
 
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
+extern Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals);
+extern void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address);
+List *
+RestoreExecParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 812b7cf..61b943b 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -20,10 +20,14 @@
 #ifndef PARSENODES_H
 #define PARSENODES_H
 
+#include "executor/instrument.h"
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
+#include "nodes/params.h"
+#include "nodes/plannodes.h"
 #include "nodes/primnodes.h"
 #include "nodes/value.h"
+#include "storage/shm_mq.h"
 
 /* Possible sources of a Query */
 typedef enum QuerySource
@@ -156,6 +160,17 @@ typedef struct Query
 								 * depends on to be semantically valid */
 } Query;
 
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+	char			*buffer_usage;
+} ParallelStmt;
 
 /****************************************************************************
  *	Supporting data structures for Parse Trees
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 5f0ea1c..7cdf632 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -281,6 +281,22 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 32a5571..9689972 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -742,6 +742,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..11f0409 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,14 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
+#define	DEFAULT_PARALLEL_STARTUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +56,12 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
+extern PGDLLIMPORT double parallel_startup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -68,6 +80,8 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info, int nWorkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9923f0e..7873565 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,11 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nWorkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 6cad92e..391d519 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -46,6 +46,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index fa72918..c38b1e0 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -131,6 +131,7 @@ extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
  */
 extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
 extern void fix_opfuncids(Node *node);
+extern void fix_node_funcids(Plan *node);
 extern void set_opfuncid(OpExpr *opexpr);
 extern void set_sa_opfuncid(ScalarArrayOpExpr *opexpr);
 extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index b10a504..8c7ce75 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+											List *rangetable, int num_exec_params);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/postmaster/backendworker.h b/src/include/postmaster/backendworker.h
new file mode 100644
index 0000000..5ddd3c8
--- /dev/null
+++ b/src/include/postmaster/backendworker.h
@@ -0,0 +1,44 @@
+/*--------------------------------------------------------------------
+ * backendworker.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/postmaster/backendworker.h
+ *--------------------------------------------------------------------
+ */
+#ifndef BACKENDWORKER_H
+#define BACKENDWORKER_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define	PARALLEL_KEY_PARAMS_EXEC	2
+#define PARALLEL_KEY_BUFF_USAGE		3
+#define PARALLEL_KEY_INST_OPTIONS	4
+#define PARALLEL_KEY_INST_INFO		5
+#define PARALLEL_KEY_TUPLE_QUEUE	6
+#define PARALLEL_KEY_SCAN			7
+
+extern int	parallel_seqscan_degree;
+
+extern void InitializeParallelWorkers(Plan *plan,
+									  List *serialized_param_exec_vals,
+									  EState *estate, Relation rel,
+									  char **inst_options_space,
+									  char **buffer_usage_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+extern shm_toc *GetParallelShmToc(void);
+
+#endif   /* BACKENDWORKER_H */
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index 085a8a7..f94ebb8 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -65,6 +65,9 @@ extern void shm_mq_set_handle(shm_mq_handle *, BackgroundWorkerHandle *);
 /* Break connection. */
 extern void shm_mq_detach(shm_mq *);
 
+/* Get the shm_mq from handle. */
+extern shm_mq *shm_mq_from_handle(shm_mq_handle *mqh);
+
 /* Send or receive messages. */
 extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
 			Size nbytes, const void *data, bool nowait);
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..ff99d2c 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
@@ -103,7 +104,9 @@ typedef enum
  *		pointers that the executor must call.
  *
  * Note: the receiveSlot routine must be passed a slot containing a TupleDesc
- * identical to the one given to the rStartup routine.
+ * identical to the one given to the rStartup routine.  It returns bool where
+ * a "true" value means "continue processing" and a "false" value means
+ * "stop early, just as if we'd reached the end of the scan".
  * ----------------
  */
 typedef struct _DestReceiver DestReceiver;
@@ -111,7 +114,7 @@ typedef struct _DestReceiver DestReceiver;
 struct _DestReceiver
 {
 	/* Called for each tuple to be output: */
-	void		(*receiveSlot) (TupleTableSlot *slot,
+	bool		(*receiveSlot) (TupleTableSlot *slot,
 											DestReceiver *self);
 	/* Per-executor-run initialization and shutdown: */
 	void		(*rStartup) (DestReceiver *self,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 96c5b8b..33211d6 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -84,5 +84,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index cf319af..38855e5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#268Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#267)
Re: Parallel Seq Scan

On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have implemented this idea (note that I have to expose a new API
shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
we sum_mq* to call shm_mq_detach) and apart this I have fixed other
problems reported on this thread:

1. Execution of initPlan by master backend and then pass the
required PARAM_EXEC parameter values to workers.
2. Avoid consuming dsm's by freeing the parallel context after
the last tuple is fetched.
3. Allow execution of Result node in worker backend as that can
be added as a gating filter on top of PartialSeqScan.
4. Merged parallel heap scan descriptor patch

To apply the patch, please follow below sequence:

HEAD Commit-Id: 4d930eee
parallel-mode-v9.patch [1]
assess-parallel-safety-v4.patch [2] (don't forget to run fixpgproc.pl in
the patch)
parallel_seqscan_v14.patch (Attached with this mail)

Thanks, this version looks like an improvement. However, I still see
some problems:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right. If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

- InitializeParallelWorkers() still mixes together general parallel
executor concerns with concerns specific to parallel sequential scan
(e.g. EstimatePartialSeqScanSpace). We have to eliminate everything
that assumes that what's under a funnel will be, specifically, a
partial sequential scan. To make this work properly, I think we should
introduce a new function that recurses over the plan tree and invokes
some callback for each plan node. I think this could be modeled on
this code from ExplainNode(), beginning around line 1593:

/* initPlan-s */
if (planstate->initPlan)
ExplainSubPlans(planstate->initPlan, ancestors, "InitPlan", es);

/* lefttree */
if (outerPlanState(planstate))
ExplainNode(outerPlanState(planstate), ancestors,
"Outer", NULL, es);

/* righttree */
if (innerPlanState(planstate))
ExplainNode(innerPlanState(planstate), ancestors,
"Inner", NULL, es);

/* special child plans */
switch (nodeTag(plan))
{
/* a bunch of special cases */
}

/* subPlan-s */
if (planstate->subPlan)
ExplainSubPlans(planstate->subPlan, ancestors, "SubPlan", es);

The new function would do the same sort of thing, but instead of
explaining each node, it would invoke a callback for each node.
Possibly explain.c could use it instead of having hard-coded logic.
Possibly it should use the same sort of return-true convention as
expression_tree_walker, query_tree_walker, and friends. So let's call
it planstate_tree_walker.

Now, instead of directly invoking logic specific to parallel
sequential scan, it should call planstate_tree_walker() on its
lefttree and pass a new function ExecParallelEstimate() as the
callback. That function ignores any node that's not parallel aware,
but when it sees a partial sequential scan (or, in the future, some a
parallel bitmap scan, parallel sort, or what have you) it does the
appropriate estimation work. When ExecParallelEstimate() finishes, we
InitializeParallelDSM(). Then, we call planstate_tree_walker() on the
lefttree again, and this time we pass another new function
ExecParallelInitializeDSM(). Like the previous one, that ignores the
callbacks from non-parallel nodes, but if it hits a parallel node,
then it fills in the parallel bits (i.e. ParallelHeapScanDesc for a
partial sequential scan).

- shm_mq_from_handle() is probably reasonable, but can we rename it
shm_mq_get_queue()?

- It's hard to believe this is right:

+       if (parallelstmt->inst_options)
+               receiver = None_Receiver;

Really? Flush the tuples if there are *any instrumentation options
whatsoever*? At the very least, that doesn't look too future-proof,
but I'm suspicious that it's outright incorrect.

- I think ParallelStmt probably shouldn't be defined in parsenodes.h.
That file is included in a lot of places, and adding all of those
extra #includes there doesn't seem like a good idea for modularity
reasons even if you don't care about partial rebuilds. Something that
includes a shm_mq obviously isn't a "parse" node in any meaningful
sense anyway.

- I don't think you need both setup cost and startup cost. Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other. I suggest removing
parallel_startup_cost and keeping parallel_setup_cost.

- In cost_funnel(), I don't think it's right to divide the run cost by
nWorkers + 1. Suppose we've got a plan that looks like this:

Funnel
-> Hash Join
-> Partial Seq Scan on a
-> Hash
-> Seq Scan on b

The sequential scan on b is going to get executed once per worker,
whereas the effort for the sequential scan on a is going to be divided
over all the workers. So the right way to cost this is as follows:

(a) The cost of the partial sequential scan on a is equal to the cost
of a regular sequential scan, plus a little bit of overhead to account
for communication via the ParallelHeapScanDesc, divided by the number
of workers + 1.
(b) The cost of the remaining nodes under the funnel works normally.
(c) The cost of the funnel is equal to the cost of the hash join plus
number of tuples multiplied by per-tuple communication overhead plus a
large fixed overhead reflecting the time it takes the workers to
start.

- While create_parallelscan_paths() is quite right to limit the number
of workers to no more than the number of pages, it's pretty obvious
that in practice that's way too conservative. I suggest we get
significantly more aggressive about that, like limiting ourselves to
one worker per thousand pages. We don't really know exactly what the
costing factors should be here just yet, but we certainly know that
spinning up lots of workers to read a handful of pages each must be
dumb. And we can save a significant amount of planning time here by
not bothering to generate parallel paths for little tiny relations.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#269Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#268)
Re: Parallel Seq Scan

On Thu, Apr 23, 2015 at 2:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have implemented this idea (note that I have to expose a new API
shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
we sum_mq* to call shm_mq_detach) and apart this I have fixed other
problems reported on this thread:

1. Execution of initPlan by master backend and then pass the
required PARAM_EXEC parameter values to workers.
2. Avoid consuming dsm's by freeing the parallel context after
the last tuple is fetched.
3. Allow execution of Result node in worker backend as that can
be added as a gating filter on top of PartialSeqScan.
4. Merged parallel heap scan descriptor patch

To apply the patch, please follow below sequence:

HEAD Commit-Id: 4d930eee
parallel-mode-v9.patch [1]
assess-parallel-safety-v4.patch [2] (don't forget to run fixpgproc.pl

in

the patch)
parallel_seqscan_v14.patch (Attached with this mail)

Thanks, this version looks like an improvement. However, I still see
some problems:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right. If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

Right, will fix as per suggestion.

- InitializeParallelWorkers() still mixes together general parallel
executor concerns with concerns specific to parallel sequential scan
(e.g. EstimatePartialSeqScanSpace).

Here we are doing 2 things, first one is for planned statement and
then second one is node specific which in the case is parallelheapscan
descriptor. So If I understand correctly, you want that we remove second
one and have a recursive function to achieve the same.

- shm_mq_from_handle() is probably reasonable, but can we rename it
shm_mq_get_queue()?

Okay, will change.

- It's hard to believe this is right:

+       if (parallelstmt->inst_options)
+               receiver = None_Receiver;

Really? Flush the tuples if there are *any instrumentation options
whatsoever*? At the very least, that doesn't look too future-proof,
but I'm suspicious that it's outright incorrect.

instrumentation info is for explain statement where we don't need
tuples and it is set same way for it as well, refer ExplainOnePlan().
What makes you feel this is incorrect?

- I think ParallelStmt probably shouldn't be defined in parsenodes.h.
That file is included in a lot of places, and adding all of those
extra #includes there doesn't seem like a good idea for modularity
reasons even if you don't care about partial rebuilds. Something that
includes a shm_mq obviously isn't a "parse" node in any meaningful
sense anyway.

How about tcop/tcopprot.h?

- I don't think you need both setup cost and startup cost. Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other. I suggest removing
parallel_startup_cost and keeping parallel_setup_cost.

There is some work (like creation of shm queues, launching of workers)
which is done proportional to number of workers during setup time. I
have kept 2 parameters to distinguish such work. I think you have a
point that start of some or all workers could be parallel, but I feel
that still is a work proportinal to number of workers. For future
parallel operations also such a parameter could be useful where we need
to setup IPC between workers or some other stuff where work is proportional
to workers.

- In cost_funnel(), I don't think it's right to divide the run cost by
nWorkers + 1. Suppose we've got a plan that looks like this:

Funnel
-> Hash Join
-> Partial Seq Scan on a
-> Hash
-> Seq Scan on b

The sequential scan on b is going to get executed once per worker,
whereas the effort for the sequential scan on a is going to be divided
over all the workers. So the right way to cost this is as follows:

(a) The cost of the partial sequential scan on a is equal to the cost
of a regular sequential scan, plus a little bit of overhead to account
for communication via the ParallelHeapScanDesc, divided by the number
of workers + 1.
(b) The cost of the remaining nodes under the funnel works normally.
(c) The cost of the funnel is equal to the cost of the hash join plus
number of tuples multiplied by per-tuple communication overhead plus a
large fixed overhead reflecting the time it takes the workers to
start.

IIUC, the change for this would be to remove the change related to
run cost (divide the run cost by nWorkers + 1) from cost_funnel
and made similar change as suggested by point (a) in cost calculation
of partial sequence scan. As of now, we don't do anything which can
move Funnel node on top of hash join, so not sure if you are expecting
any extra handling as part of point (b) or (c).

- While create_parallelscan_paths() is quite right to limit the number
of workers to no more than the number of pages, it's pretty obvious
that in practice that's way too conservative. I suggest we get
significantly more aggressive about that, like limiting ourselves to
one worker per thousand pages. We don't really know exactly what the
costing factors should be here just yet, but we certainly know that
spinning up lots of workers to read a handful of pages each must be
dumb. And we can save a significant amount of planning time here by
not bothering to generate parallel paths for little tiny relations.

makes sense, will change.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#270Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#269)
Re: Parallel Seq Scan

On Fri, Apr 24, 2015 at 8:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

- InitializeParallelWorkers() still mixes together general parallel
executor concerns with concerns specific to parallel sequential scan
(e.g. EstimatePartialSeqScanSpace).

Here we are doing 2 things, first one is for planned statement and
then second one is node specific which in the case is parallelheapscan
descriptor. So If I understand correctly, you want that we remove second
one and have a recursive function to achieve the same.

Right.

- It's hard to believe this is right:

+       if (parallelstmt->inst_options)
+               receiver = None_Receiver;

Really? Flush the tuples if there are *any instrumentation options
whatsoever*? At the very least, that doesn't look too future-proof,
but I'm suspicious that it's outright incorrect.

instrumentation info is for explain statement where we don't need
tuples and it is set same way for it as well, refer ExplainOnePlan().
What makes you feel this is incorrect?

Well, for one thing, it's going to completely invalidate the result of
EXPLAIN. I mean, consider this:

Hash Join
-> Parallel Seq Scan
-> Hash
-> Seq Scan

If you have the workers throw away the rows from the parallel seq scan
instead of sending them back to the master, the master won't join
those rows against the other table. And then the "actual" row counts,
timing, etc. will all be totally wrong. Worse, if the user is
EXPLAIN-ing a SELECT INTO command, the results will be totally wrong.

I don't think you can use ExplainOnePlan() as precedent for the theory
that explain_options != 0 means discard everything, because that
function does not do that. It bases the decision to throw away the
output on the fact that EXPLAIN was used, and throws it away unless an
IntoClause was also specified. It does this even if
instrument_options == 0. Meanwhile, auto_explain does NOT throw away
the output even if instrument_options != 0, nor should it! But even
if none of that were an issue, throwing away part of the results from
an internal plan tree is not the same thing as throwing away the final
result stream, and is dead wrong.

- I think ParallelStmt probably shouldn't be defined in parsenodes.h.
That file is included in a lot of places, and adding all of those
extra #includes there doesn't seem like a good idea for modularity
reasons even if you don't care about partial rebuilds. Something that
includes a shm_mq obviously isn't a "parse" node in any meaningful
sense anyway.

How about tcop/tcopprot.h?

The comment of that file is "prototypes for postgres.c".

Generally, unless there is some reason to do otherwise, the prototypes
for a .c file in src/backend go in a .h file with the same name in
src/include. I don't see why we should do differently here.
ParallelStmt should be defined and used in a file living in
src/backend/executor, and the header should have the same name and go
in src/include/executor.

- I don't think you need both setup cost and startup cost. Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other. I suggest removing
parallel_startup_cost and keeping parallel_setup_cost.

There is some work (like creation of shm queues, launching of workers)
which is done proportional to number of workers during setup time. I
have kept 2 parameters to distinguish such work. I think you have a
point that start of some or all workers could be parallel, but I feel
that still is a work proportinal to number of workers. For future
parallel operations also such a parameter could be useful where we need
to setup IPC between workers or some other stuff where work is proportional
to workers.

That's technically true, but the incremental work involved in
supporting a new worker is extremely small compare with worker startup
times. I'm guessing that the setup cost is going to be on the order
of hundred-thousands or millions and and the startup cost is going to
be on the order of tens or ones. Unless you can present some contrary
evidence, I think we should rip it out.

And I actually hope you *can't* present some contrary evidence.
Because if you can, then that might mean that we need to cost every
possible path from 0 up to N workers and let the costing machinery
decide which one is better. If you can't, then we can cost the
non-parallel path and the maximally-parallel path and be done. And
that would be much better, because it will be faster. Remember, just
because we cost a bunch of parallel paths doesn't mean that any of
them will actually be chosen. We need to avoid generating too much
additional planner work in cases where we don't end up deciding on
parallelism anyway.

- In cost_funnel(), I don't think it's right to divide the run cost by
nWorkers + 1. Suppose we've got a plan that looks like this:

Funnel
-> Hash Join
-> Partial Seq Scan on a
-> Hash
-> Seq Scan on b

The sequential scan on b is going to get executed once per worker,
whereas the effort for the sequential scan on a is going to be divided
over all the workers. So the right way to cost this is as follows:

(a) The cost of the partial sequential scan on a is equal to the cost
of a regular sequential scan, plus a little bit of overhead to account
for communication via the ParallelHeapScanDesc, divided by the number
of workers + 1.
(b) The cost of the remaining nodes under the funnel works normally.
(c) The cost of the funnel is equal to the cost of the hash join plus
number of tuples multiplied by per-tuple communication overhead plus a
large fixed overhead reflecting the time it takes the workers to
start.

IIUC, the change for this would be to remove the change related to
run cost (divide the run cost by nWorkers + 1) from cost_funnel
and made similar change as suggested by point (a) in cost calculation
of partial sequence scan.

Right.

As of now, we don't do anything which can
move Funnel node on top of hash join, so not sure if you are expecting
any extra handling as part of point (b) or (c).

But we will want to do that in the future, so we should set up the
costing correctly now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#271Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#270)
Re: Parallel Seq Scan

On Tue, Apr 28, 2015 at 5:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 24, 2015 at 8:32 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right. If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

Right, will fix as per suggestion.

I observed one issue while working on this review comment. When we
try to destroy the parallel setup via ExecEndNode (as due to Limit
Node, it could not destroy after consuming all tuples), it waits for
parallel
workers to finish (WaitForParallelWorkersToFinish()) and parallel workers
are waiting for master backend to signal them as their queue is full.
I think in such a case master backend needs to inform workers either when
the scan is discontinued due to limit node or while waiting for parallel
workers to finish.

- I don't think you need both setup cost and startup cost. Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other. I suggest removing
parallel_startup_cost and keeping parallel_setup_cost.

There is some work (like creation of shm queues, launching of workers)
which is done proportional to number of workers during setup time. I
have kept 2 parameters to distinguish such work. I think you have a
point that start of some or all workers could be parallel, but I feel
that still is a work proportinal to number of workers. For future
parallel operations also such a parameter could be useful where we need
to setup IPC between workers or some other stuff where work is

proportional

to workers.

That's technically true, but the incremental work involved in
supporting a new worker is extremely small compare with worker startup
times. I'm guessing that the setup cost is going to be on the order
of hundred-thousands or millions and and the startup cost is going to
be on the order of tens or ones.

Can we safely estimate the cost of restoring parallel state (GUC's,
combo CID, transaction state, snapshot, etc.) in each worker as a setup
cost? There could be some work like restoration of locks (acquire all or
relevant locks at start of parallel worker, if we follow your proposed
design
and even if we don't follow that there could be some similar substantial
work)
which could be substantial and we need to do the same for each worker.
If you think restoration of parallel state in each worker is a pretty
small work, then what you say makes sense to me.

And I actually hope you *can't* present some contrary evidence.
Because if you can, then that might mean that we need to cost every
possible path from 0 up to N workers and let the costing machinery
decide which one is better.

Not necesarally, we can follow a rule that number of workers
that need to be used for any parallel statement are equal to degree of
parallelism (parallel_seqscan_degree) as set by user. I think we
need to do some split up of number workers when there are multiple
parallel operations in single statement (like sort and parallel scan).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#272Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#271)
Re: Parallel Seq Scan

On Wed, May 6, 2015 at 7:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right. If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

Right, will fix as per suggestion.

I observed one issue while working on this review comment. When we
try to destroy the parallel setup via ExecEndNode (as due to Limit
Node, it could not destroy after consuming all tuples), it waits for
parallel
workers to finish (WaitForParallelWorkersToFinish()) and parallel workers
are waiting for master backend to signal them as their queue is full.
I think in such a case master backend needs to inform workers either when
the scan is discontinued due to limit node or while waiting for parallel
workers to finish.

Isn't this why TupleQueueFunnelShutdown() calls shm_mq_detach()?
That's supposed to unstick the workers; any impending or future writes
will just return SHM_MQ_DETACHED without waiting.

That's technically true, but the incremental work involved in
supporting a new worker is extremely small compare with worker startup
times. I'm guessing that the setup cost is going to be on the order
of hundred-thousands or millions and and the startup cost is going to
be on the order of tens or ones.

Can we safely estimate the cost of restoring parallel state (GUC's,
combo CID, transaction state, snapshot, etc.) in each worker as a setup
cost? There could be some work like restoration of locks (acquire all or
relevant locks at start of parallel worker, if we follow your proposed
design
and even if we don't follow that there could be some similar substantial
work)
which could be substantial and we need to do the same for each worker.
If you think restoration of parallel state in each worker is a pretty
small work, then what you say makes sense to me.

Well, all the workers restore that state in parallel, so adding it up
across all workers doesn't really make sense. But anyway, no, I don't
think that's a big cost. I think the big cost is going to the
operating system overhead of process creation. The new process will
incur lots of page faults as it populates its address space and
dirties pages marked copy-on-write. That's where I expect most of the
expense to be.

And I actually hope you *can't* present some contrary evidence.
Because if you can, then that might mean that we need to cost every
possible path from 0 up to N workers and let the costing machinery
decide which one is better.

Not necesarally, we can follow a rule that number of workers
that need to be used for any parallel statement are equal to degree of
parallelism (parallel_seqscan_degree) as set by user. I think we
need to do some split up of number workers when there are multiple
parallel operations in single statement (like sort and parallel scan).

Yeah. I'm hoping we will be able to use the same pool of workers for
multiple operations, but I realize that's a feature we haven't
designed yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#273Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#272)
Re: Parallel Seq Scan

On Wed, May 6, 2015 at 7:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 6, 2015 at 7:55 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right. If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

Right, will fix as per suggestion.

I observed one issue while working on this review comment. When we
try to destroy the parallel setup via ExecEndNode (as due to Limit
Node, it could not destroy after consuming all tuples), it waits for
parallel
workers to finish (WaitForParallelWorkersToFinish()) and parallel

workers

are waiting for master backend to signal them as their queue is full.
I think in such a case master backend needs to inform workers either

when

the scan is discontinued due to limit node or while waiting for parallel
workers to finish.

Isn't this why TupleQueueFunnelShutdown() calls shm_mq_detach()?
That's supposed to unstick the workers; any impending or future writes
will just return SHM_MQ_DETACHED without waiting.

Okay, that can work if we call it in ExecEndNode() before
WaitForParallelWorkersToFinish(), however what if we want to do something
like TupleQueueFunnelShutdown() when Limit node decides to stop processing
the outer node. We can traverse the whole plan tree and find the nodes
where
parallel workers needs to be stopped, but I don't think thats good way to
handle
it. If we don't want to stop workers from processing until
ExecutorEnd()--->ExecEndNode(), then it will lead to workers continuing till
that time and it won't be easy to get instrumentation/buffer usage
information
from workers (workers fill such information for master backend after
execution
is complete) as that is done before ExecutorEnd(). For Explain Analyze ..,
we
can ensure that workers are stopped before fetching that information from
Funnel node, but the same is not easy for buffer usage stats required by
plugins as that operates at ExecutorRun() and ExecutorFinish() level where
we don't have direct access to node level information. You can refer
pgss_ExecutorEnd() where it completes the storage of stats information
before calling ExecutorEnd(). Offhand, I could not think of a good way to
do this, but one crude way could be introduce a new API
(ParallelExecutorEnd())
for such plugins which needs to be called before completing the stats
accumulation.
This API will call ExecEndPlan() if parallelmodeNeeded flag is set and allow
accumulation of stats (InstrStartNode()/InstrStopNode())

Well, all the workers restore that state in parallel, so adding it up
across all workers doesn't really make sense. But anyway, no, I don't
think that's a big cost. I think the big cost is going to the
operating system overhead of process creation. The new process will
incur lots of page faults as it populates its address space and
dirties pages marked copy-on-write. That's where I expect most of the
expense to be.

Okay, will remove parallel_startup_cost from patch in next version.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#274Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#273)
Re: Parallel Seq Scan

On Thu, May 7, 2015 at 3:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I observed one issue while working on this review comment. When we
try to destroy the parallel setup via ExecEndNode (as due to Limit
Node, it could not destroy after consuming all tuples), it waits for
parallel
workers to finish (WaitForParallelWorkersToFinish()) and parallel
workers
are waiting for master backend to signal them as their queue is full.
I think in such a case master backend needs to inform workers either
when
the scan is discontinued due to limit node or while waiting for parallel
workers to finish.

Isn't this why TupleQueueFunnelShutdown() calls shm_mq_detach()?
That's supposed to unstick the workers; any impending or future writes
will just return SHM_MQ_DETACHED without waiting.

Okay, that can work if we call it in ExecEndNode() before
WaitForParallelWorkersToFinish(), however what if we want to do something
like TupleQueueFunnelShutdown() when Limit node decides to stop processing
the outer node. We can traverse the whole plan tree and find the nodes
where
parallel workers needs to be stopped, but I don't think thats good way to
handle
it. If we don't want to stop workers from processing until
ExecutorEnd()--->ExecEndNode(), then it will lead to workers continuing till
that time and it won't be easy to get instrumentation/buffer usage
information
from workers (workers fill such information for master backend after
execution
is complete) as that is done before ExecutorEnd(). For Explain Analyze ..,
we
can ensure that workers are stopped before fetching that information from
Funnel node, but the same is not easy for buffer usage stats required by
plugins as that operates at ExecutorRun() and ExecutorFinish() level where
we don't have direct access to node level information. You can refer
pgss_ExecutorEnd() where it completes the storage of stats information
before calling ExecutorEnd(). Offhand, I could not think of a good way to
do this, but one crude way could be introduce a new API
(ParallelExecutorEnd())
for such plugins which needs to be called before completing the stats
accumulation.
This API will call ExecEndPlan() if parallelmodeNeeded flag is set and allow
accumulation of stats (InstrStartNode()/InstrStopNode())

OK, so if I understand you here, the problem is what to do about an
"orphaned" worker. The Limit node just stops fetching from the lower
nodes, and those nodes don't get any clue that this has happened, so
their workers just sit there until the end of the query. Of course,
that happens already, but it doesn't usually hurt very much, because
the Limit node usually appears at or near the top of the plan.

It could matter, though. Suppose the Limit is for a subquery that has
a Sort somewhere (not immediately) beneath it. My guess is the Sort's
tuplestore will stick around until after the subquery finishes
executing for as long as the top-level query is executing, which in
theory could be a huge waste of resources. In practice, I guess
people don't really write queries that way. If they did, I think we'd
have already developed some general method for fixing this sort of
problem.

I think it might be better to try to solve this problem in a more
localized way. Can we arrange for planstate->instrumentation to point
directory into the DSM, instead of copying the data over later? That
seems like it might help, or perhaps there's another approach.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#275Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#267)
Re: Parallel Seq Scan

On Wed, Apr 22, 2015 at 10:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

parallel_seqscan_v14.patch (Attached with this mail)

This patch is not applying/working with the latest head after parallel
mode patch got committed.
can you please rebase the patch.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#276Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#275)
Re: Parallel Seq Scan

On Mon, May 18, 2015 at 6:28 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Wed, Apr 22, 2015 at 10:48 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

parallel_seqscan_v14.patch (Attached with this mail)

This patch is not applying/working with the latest head after parallel
mode patch got committed.
can you please rebase the patch.

Thanks for reminding, I am planing to work on remaining review
comments in this week and will post a new version.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#277Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#274)
Re: Parallel Seq Scan

On Mon, May 11, 2015 at 3:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think it might be better to try to solve this problem in a more
localized way. Can we arrange for planstate->instrumentation to point
directory into the DSM, instead of copying the data over later?

Yes, we can do that but I am not sure we can do that for pgBufferUsage
which is a separate information we need to pass back to master backend.
One way could be to change pgBufferUsage to a pointer and then allocate
the memory for same at backend startup time and for parallel workers, it
should point to DSM. Do you see any simple way to handle it?

Another way could be that master backend waits for parallel workers to
finish before collecting the instrumentation information and buffer usage
stats. It seems to me that we need this information (stats) after execution
in master backend is over, so I think we can safely assume that it is okay
to finish the execution of parallel workers if they are not already finished
the execution.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#278Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#277)
Re: Parallel Seq Scan

On Tue, May 19, 2015 at 8:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 11, 2015 at 3:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think it might be better to try to solve this problem in a more
localized way. Can we arrange for planstate->instrumentation to point
directory into the DSM, instead of copying the data over later?

Yes, we can do that but I am not sure we can do that for pgBufferUsage
which is a separate information we need to pass back to master backend.
One way could be to change pgBufferUsage to a pointer and then allocate
the memory for same at backend startup time and for parallel workers, it
should point to DSM. Do you see any simple way to handle it?

No, that seems problematic.

Another way could be that master backend waits for parallel workers to
finish before collecting the instrumentation information and buffer usage
stats. It seems to me that we need this information (stats) after execution
in master backend is over, so I think we can safely assume that it is okay
to finish the execution of parallel workers if they are not already finished
the execution.

I'm not sure exactly where you plan to insert the wait.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#279Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#268)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Apr 23, 2015 at 2:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 22, 2015 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have implemented this idea (note that I have to expose a new API
shm_mq_from_handle as TupleQueueFunnel stores shm_mq_handle* and
we sum_mq* to call shm_mq_detach) and apart this I have fixed other
problems reported on this thread:

1. Execution of initPlan by master backend and then pass the
required PARAM_EXEC parameter values to workers.
2. Avoid consuming dsm's by freeing the parallel context after
the last tuple is fetched.
3. Allow execution of Result node in worker backend as that can
be added as a gating filter on top of PartialSeqScan.
4. Merged parallel heap scan descriptor patch

To apply the patch, please follow below sequence:

HEAD Commit-Id: 4d930eee
parallel-mode-v9.patch [1]
assess-parallel-safety-v4.patch [2] (don't forget to run fixpgproc.pl

in

the patch)
parallel_seqscan_v14.patch (Attached with this mail)

Thanks, this version looks like an improvement. However, I still see
some problems:

- I believe the separation of concerns between ExecFunnel() and
ExecEndFunnel() is not quite right. If the scan is shut down before
it runs to completion (e.g. because of LIMIT), then I think we'll call
ExecEndFunnel() before ExecFunnel() hits the TupIsNull(slot) path. I
think you probably need to create a static subroutine that is called
both as soon as TupIsNull(slot) and also from ExecEndFunnel(), in each
case cleaning up whatever resources remain.

Okay, added new routine FinishParallelSetupAndAccumStats() which
will be called both from ExecEndFunnel() and when ExecFunnel() hits
the TupIsNull(slot) path. Apart from that the same routine is called
from some other paths like rescan and when we need to collect
statistics after execution is complete but still ExecEndFunnel() is
not called. This routine ensures that once it has collected the
stats of parallel workers and destroyed the parallel context, it will
do nothing on next execution unless the node is re-initialized.

- InitializeParallelWorkers() still mixes together general parallel
executor concerns with concerns specific to parallel sequential scan
(e.g. EstimatePartialSeqScanSpace). We have to eliminate everything
that assumes that what's under a funnel will be, specifically, a
partial sequential scan.

Okay, introduced the new function planstate_tree_walker(), so that
it can work for anything below funnel node.

- shm_mq_from_handle() is probably reasonable, but can we rename it
shm_mq_get_queue()?

Changed as per suggestion.

- It's hard to believe this is right:

+       if (parallelstmt->inst_options)
+               receiver = None_Receiver;

Really? Flush the tuples if there are *any instrumentation options
whatsoever*? At the very least, that doesn't look too future-proof,
but I'm suspicious that it's outright incorrect.

You are right, I have removed this part of code.

- I think ParallelStmt probably shouldn't be defined in parsenodes.h.
That file is included in a lot of places, and adding all of those
extra #includes there doesn't seem like a good idea for modularity
reasons even if you don't care about partial rebuilds. Something that
includes a shm_mq obviously isn't a "parse" node in any meaningful
sense anyway.

Changed postmaster/backendworkers.c to executor/execParallel.c
and moved ParallelStmt to executor/execParallel.h

- I don't think you need both setup cost and startup cost. Starting
up more workers isn't particularly more expensive than starting up
fewer of them, because most of the overhead is in waiting for them to
actually start, and the number of workers is reasonable, then they're
all be doing that in parallel with each other. I suggest removing
parallel_startup_cost and keeping parallel_setup_cost.

As per discussion, it makes sense to remove parallel_startup_cost.

- In cost_funnel(), I don't think it's right to divide the run cost by
nWorkers + 1. Suppose we've got a plan that looks like this:

Funnel
-> Hash Join
-> Partial Seq Scan on a
-> Hash
-> Seq Scan on b

The sequential scan on b is going to get executed once per worker,
whereas the effort for the sequential scan on a is going to be divided
over all the workers. So the right way to cost this is as follows:

(a) The cost of the partial sequential scan on a is equal to the cost
of a regular sequential scan, plus a little bit of overhead to account
for communication via the ParallelHeapScanDesc, divided by the number
of workers + 1.
(b) The cost of the remaining nodes under the funnel works normally.
(c) The cost of the funnel is equal to the cost of the hash join plus
number of tuples multiplied by per-tuple communication overhead plus a
large fixed overhead reflecting the time it takes the workers to
start.

Okay, changed as per suggestion.

- While create_parallelscan_paths() is quite right to limit the number
of workers to no more than the number of pages, it's pretty obvious
that in practice that's way too conservative. I suggest we get
significantly more aggressive about that, like limiting ourselves to
one worker per thousand pages. We don't really know exactly what the
costing factors should be here just yet, but we certainly know that
spinning up lots of workers to read a handful of pages each must be
dumb. And we can save a significant amount of planning time here by
not bothering to generate parallel paths for little tiny relations.

Right, I have changed as per suggestion, but now it will only choose
the parallel path for bigger relations, so to test with smaller relations
one way is to reduce the cpu_tuple_comm_cost.

Note - You need to apply assess-parallel-safety-v5.patch (posted by
Robert on thread assessing parallel-safety) before this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v15.patchapplication/octet-stream; name=parallel_seqscan_v15.patchDownload
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..639451a 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -26,9 +26,9 @@
 
 static void printtup_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-static void printtup(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_20(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
 static void printtup_shutdown(DestReceiver *self);
 static void printtup_destroy(DestReceiver *self);
 
@@ -299,7 +299,7 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
  *		printtup --- print a tuple in protocol 3.0
  * ----------------
  */
-static void
+static bool
 printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -376,13 +376,15 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
  *		printtup_20 --- print a tuple in protocol 2.0
  * ----------------
  */
-static void
+static bool
 printtup_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -452,6 +454,8 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
@@ -528,7 +532,7 @@ debugStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		debugtup - print one tuple for an interactive backend
  * ----------------
  */
-void
+bool
 debugtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -553,6 +557,8 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
 		printatt((unsigned) i + 1, typeinfo->attrs[i], value);
 	}
 	printf("\t----\n");
+
+	return true;
 }
 
 /* ----------------
@@ -564,7 +570,7 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
  * This is largely same as printtup_20, except we use binary formatting.
  * ----------------
  */
-static void
+static bool
 printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -636,4 +642,6 @@ printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb86a4f..9324b7e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,9 +81,11 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat, bool allow_sync, bool allow_pagemode,
 						bool is_bitmapscan, bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(ParallelHeapScanDesc);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -223,7 +226,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -483,7 +489,18 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -506,6 +523,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -658,11 +678,19 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -760,7 +788,18 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -780,6 +819,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -921,11 +963,19 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1321,7 +1371,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1331,7 +1381,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1340,7 +1390,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1349,7 +1399,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1358,7 +1408,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, false, allow_pagemode,
 								   false, true, false);
 }
@@ -1366,6 +1416,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat, bool allow_sync, bool allow_pagemode,
 						bool is_bitmapscan, bool is_samplescan, bool temp_snap)
 {
@@ -1394,6 +1445,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1487,6 +1539,94 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = 0;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		A return value larger than the number of blocks to be scanned
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(ParallelHeapScanDesc parallel_scan)
+{
+	BlockNumber	page = InvalidBlockNumber;
+
+	/* we treat InvalidBlockNumber specially here to avoid overflow */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock != InvalidBlockNumber)
+		page = parallel_scan->phs_cblock++;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot		snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 18921c4..967672a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -671,7 +671,7 @@ CREATE VIEW pg_replication_slots AS
             L.datoid,
             D.datname AS database,
             L.active,
-            L.active_pid,
+			L.active_pid,
             L.xmin,
             L.catalog_xmin,
             L.restart_lsn
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3e14c53..aa10678 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4394,7 +4394,7 @@ copy_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * copy_dest_receive --- receive one tuple
  */
-static void
+static bool
 copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_copy    *myState = (DR_copy *) self;
@@ -4406,6 +4406,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 	/* And send the data */
 	CopyOneRowTo(cstate, InvalidOid, slot->tts_values, slot->tts_isnull);
 	myState->processed++;
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index e8f0d79..2eac70e 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,7 +62,7 @@ typedef struct
 static ObjectAddress CreateAsReladdr = {InvalidOid, InvalidOid, 0};
 
 static void intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void intorel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool intorel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void intorel_shutdown(DestReceiver *self);
 static void intorel_destroy(DestReceiver *self);
 
@@ -482,7 +482,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * intorel_receive --- receive one tuple
  */
-static void
+static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
@@ -507,6 +507,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9d47308..4b98aaa 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -20,6 +20,7 @@
 #include "commands/defrem.h"
 #include "commands/prepare.h"
 #include "executor/hashjoin.h"
+#include "executor/nodeFunnel.h"
 #include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
@@ -728,6 +729,8 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -933,6 +936,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1098,6 +1107,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1245,6 +1256,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for Funnel node.  Though we already accumulate this
+	 * information when last tuple is fetched from Funnel node, this
+	 * is to cover cases when we don't fetch all tuples from a node
+	 * such as for Limit node.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+		FinishParallelSetupAndAccumStats((FunnelState *)planstate);
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1361,6 +1382,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
 			break;
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -1371,6 +1393,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2357,6 +2387,8 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index eb16bb3..78f822b 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -56,7 +56,7 @@ typedef struct
 static int	matview_maintenance_depth = 0;
 
 static void transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static void refresh_matview_datafill(DestReceiver *dest, Query *query,
@@ -422,7 +422,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * transientrel_receive --- receive one tuple
  */
-static void
+static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
@@ -441,6 +441,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 08cba6f..be1f47e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -13,17 +13,17 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execProcnode.o execQual.o execScan.o execTuples.o \
+       execMain.o execParallel.o execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4948a26..7f9baa6 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -37,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -160,6 +162,14 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -463,6 +473,10 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..7a44462 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index aefc9fa..a6417ef 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -45,9 +45,11 @@
 #include "commands/matview.h"
 #include "commands/trigger.h"
 #include "executor/execdebug.h"
+#include "executor/execParallel.h"
 #include "foreign/fdwapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -323,6 +325,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	operation = queryDesc->operation;
 	dest = queryDesc->dest;
 
+	/* inform executor to collect buffer usage stats from parallel workers. */
+	estate->total_time = queryDesc->totaltime ? 1 : 0;
+
 	/*
 	 * startup tuple receiver, if we will be emitting tuples
 	 */
@@ -354,7 +359,15 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		(*dest->rShutdown) (dest);
 
 	if (queryDesc->totaltime)
+	{
+		/*
+		 * Accumulate the stats by parallel workers before stopping the
+		 * node.
+		 */
+		(void) planstate_tree_walker((Node*) queryDesc->planstate,
+									 NULL, ExecParallelBufferUsageAccum, 0);
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -1582,7 +1595,15 @@ ExecutePlan(EState *estate,
 		 * practice, this is probably always the case at this point.)
 		 */
 		if (sendTuples)
-			(*dest->receiveSlot) (slot, dest);
+		{
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
+		}
 
 		/*
 		 * Count tuples processed, if this is a SELECT.  (For other operation
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 0000000..33e83fe
--- /dev/null
+++ b/src/backend/executor/execParallel.c
@@ -0,0 +1,592 @@
+/*-------------------------------------------------------------------------
+ *
+ * execParallel.c
+ *	  Support routines for setting up backend workers for parallel execution.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void
+EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState* planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size);
+static void
+StorePlannedStmt(ParallelContext *pcxt, PlanState* planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * This is required for parallel plan execution to fetch the
+ * information from dsm.
+ */
+static shm_toc *parallel_shm_toc = NULL;
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters, PARAM_EXEC parameters and instrumentation
+ * information that need to be retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	*params_exec_size = EstimateExecParametersSpace(serialized_param_exec_vals);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_exec_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the usage along with it's own, so account it for each worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 4);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters, PARAM_EXEC parameters and instrumentation
+ * information required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	*paramsdata;
+	char	*paramsexecdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Store PARAM_EXEC parameters list in dynamic shared memory.  This is
+	 * used for evaluation plan->initPlan params.
+	 */
+	paramsexecdata = shm_toc_allocate(pcxt->toc, params_exec_size);
+	SerializeExecParams(serialized_param_exec_vals, params_exec_size, paramsexecdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS_EXEC, paramsexecdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by
+	 * each worker.
+	 */
+	*buffer_usage_space =
+			shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePlannedStmtSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel node specific information that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState* planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size)
+{
+	/* Estimate space for planned statement. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	/* keys for planned statement information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	(void) planstate_tree_walker((Node*)planstate, pcxt, ExecParallelEstimate,
+								 pscan_size);
+}
+
+/*
+ * StorePlannedStmt
+ * 
+ * Sets up the planned statement and node specific information.
+ */
+void
+StorePlannedStmt(ParallelContext *pcxt, PlanState* planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size)
+{
+	char		*plannedstmtdata;
+
+	/* Store planned statement in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	(void) planstate_tree_walker((Node*)planstate, pcxt, ExecParallelInitializeDSM,
+								 &pscan_size);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ *	ExecParallelEstimate
+ *
+ *		Estimate the amount of space required to record information of
+ * parallel node that need to be copied to parallel workers.
+ */
+bool
+ExecParallelEstimate(Node *node, ParallelContext *pcxt,
+					 Size *pscan_size)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_ResultState:
+			{
+				PlanState *planstate = ((ResultState*)node)->ps.lefttree;
+
+				return planstate_tree_walker((Node*)planstate, pcxt,
+											 ExecParallelEstimate, pscan_size);
+			}
+		case T_PartialSeqScanState:
+			{
+				EState		*estate = ((PartialSeqScanState*)node)->ss.ps.state;
+
+				*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+				shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+				/* key for paratial scan information. */
+				shm_toc_estimate_keys(&pcxt->estimator, 1);
+				return true;
+			}
+		default:
+			break;
+	}
+
+	return false;
+}
+
+/*
+ *	ExecParallelInitializeDSM
+ *
+ *		Store the information of parallel node in dsm.
+ */
+bool
+ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
+						  Size *pscan_size)
+{
+	ParallelHeapScanDesc pscan;
+
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_ResultState:
+			{
+				PlanState *planstate = ((ResultState*)node)->ps.lefttree;
+
+				return planstate_tree_walker((Node*)planstate, pcxt,
+											 ExecParallelInitializeDSM, pscan_size);
+			}
+		case T_PartialSeqScanState:
+			{
+				EState	*estate = ((PartialSeqScanState*)node)->ss.ps.state;
+
+				/* Store parallel heap scan descriptor in dynamic shared memory. */
+				pscan = shm_toc_allocate(pcxt->toc, *pscan_size);
+				heap_parallelscan_initialize(pscan, ((PartialSeqScanState*)node)->ss.ss_currentRelation, estate->es_snapshot);
+				shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+				return true;
+			}
+		default:
+			break;
+	}
+
+	return false;
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(PlanState *planstate,
+						  List *serialized_param_exec_vals,
+						  EState *estate,
+						  char **inst_options_space,
+						  char **buffer_usage_space,
+						  shm_mq_handle ***responseqp,
+						  ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size, params_exec_size, pscan_size, plannedstmt_size;
+	char		*plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt((PartialSeqScan *)planstate->plan,
+													 estate->es_range_table,
+													 estate->es_plannedstmt->nParamExec);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePlannedStmtSpace(pcxt, planstate, plannedstmt_str,
+							 &plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 serialized_param_exec_vals,
+									 estate->es_instrument, &params_size,
+									 &params_exec_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePlannedStmt(pcxt, planstate, plannedstmt_str,
+					 plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 serialized_param_exec_vals,
+							 estate->es_instrument,
+							 params_size,
+							 params_exec_size,
+							 inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameters, PARAM_EXEC parameters and
+ * instrumentation information required to perform parallel
+ * operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char		*paramsdata;
+	char		*paramsexecdata;
+	char		*inst_options_space;
+	char		*buffer_usage_space;
+	int			*instoptions;
+
+	if (params)
+	{
+		paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+		*params = RestoreBoundParams(paramsdata);
+	}
+
+	if (serialized_param_exec_vals)
+	{
+		paramsexecdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS_EXEC);
+		*serialized_param_exec_vals = RestoreExecParams(paramsexecdata);
+	}
+
+	if (inst_options)
+	{
+		instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+		*inst_options = *instoptions;
+		if (inst_options)
+		{
+			inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+			*instrument = (inst_options_space +
+				ParallelWorkerNumber * sizeof(Instrumentation));
+		}
+	}
+
+	if (buffer_usage)
+	{
+		buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+		*buffer_usage = (buffer_usage_space +
+					 ParallelWorkerNumber * sizeof(BufferUsage));
+	}
+}
+
+/*
+ * ExecParallelGetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_node_funcids((*plannedstmt)->planTree);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * GetParallelShmToc
+ */
+shm_toc *
+GetParallelShmToc(void)
+{
+	return parallel_shm_toc;
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	int				inst_options;
+	char			*instrument = NULL;
+	char			*buffer_usage = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	ExecParallelGetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &serialized_param_exec_vals,
+						   &inst_options, &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->serialized_param_exec_vals = serialized_param_exec_vals;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->responseq = responseq;
+
+	parallel_shm_toc = toc;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
+
+/*
+ * ExecParallelBufferUsageAccum
+ *
+ * Recursively accumulate the stats for all the funnel nodes
+ * in a plan state tree.
+ */
+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_FunnelState:
+			{
+				FinishParallelSetupAndAccumStats((FunnelState*)node);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
+
+	(void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree, NULL,
+								 ExecParallelBufferUsageAccum, 0);
+	(void) planstate_tree_walker((Node*)((PlanState *)node)->righttree, NULL,
+								 ExecParallelBufferUsageAccum, 0);
+	return false;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 03c2feb..e24a439 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -196,6 +198,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -416,6 +428,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -658,6 +678,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a05d8b1..d5619bd 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1313,7 +1313,7 @@ do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull)
 	ExecStoreVirtualTuple(slot);
 
 	/* send the tuple to the receiver */
-	(*tstate->dest->receiveSlot) (slot, tstate->dest);
+	(void) (*tstate->dest->receiveSlot) (slot, tstate->dest);
 
 	/* clean up */
 	ExecClearTuple(slot);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 3963408..adf7439 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -974,3 +974,28 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 
 	MemoryContextSwitchTo(oldcontext);
 }
+
+/*
+ * Populate the values of PARAM_EXEC parameters.
+ *
+ * This is used by worker backends to fill in the values
+ * of PARAM_EXEC parameters after fetching the same from
+ * dynamic shared memory.  This needs to be called before
+ * ExecutorRun.
+ */
+void
+PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals)
+{
+	ListCell	*lparam;
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].value =
+																param_val->value;
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].isnull =
+																param_val->isnull;
+	}
+}
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..863bd64 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -167,7 +167,7 @@ static Datum postquel_get_single_result(TupleTableSlot *slot,
 static void sql_exec_error_callback(void *arg);
 static void ShutdownSQLFunction(Datum arg);
 static void sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
 static void sqlfunction_shutdown(DestReceiver *self);
 static void sqlfunction_destroy(DestReceiver *self);
 
@@ -1903,7 +1903,7 @@ sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * sqlfunction_receive --- receive one tuple
  */
-static void
+static bool
 sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_sqlfunction *myState = (DR_sqlfunction *) self;
@@ -1913,6 +1913,8 @@ sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Store the filtered tuple into the tuplestore */
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..283a136 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,30 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +167,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..3c42f21
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,436 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		rescans a relation
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "executor/nodeSubplan.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+static void ExecAccumulateInstInfo(FunnelState *node);
+static void ExecAccumulateBufUsageInfo(FunnelState *node);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	 /* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution.
+	 * We do this on first execution rather than during node initialization,
+	 * as it needs to allocate large dynamic segement, so it is better to 
+	 * do if it is really needed.
+	 */
+	if (!node->pcxt)
+	{
+		EState		*estate = node->ss.ps.state;
+		ExprContext *econtext = node->ss.ps.ps_ExprContext;
+		bool		any_worker_launched = false;
+		List		*serialized_param_exec;
+
+		/*
+		 * Evaluate the InitPlan and pass the PARAM_EXEC params, so that
+		 * values can be shared with worker backend.  This is different
+		 * from the way InitPlans are evaluated (lazy evaluation) at other
+		 * places as instead of sharing the InitPlan to all the workers
+		 * and let them execute, we pass the values which can be directly
+		 * used by worker backends.
+		 */
+		serialized_param_exec = ExecAndFormSerializeParamExec(econtext,
+											node->ss.ps.plan->lefttree->allParam);
+
+		/* Initialize the workers required to execute funnel node. */
+		InitializeParallelWorkers(node->ss.ps.lefttree,
+								  serialized_param_exec,
+								  estate,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are
+		 * not available then we perform the scan with available workers and
+		 * If there are no more workers available, then the funnel node will
+		 * just scan locally.
+		 */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+	
+	slot = funnel_getnext(node);
+	
+	if (TupIsNull(slot))
+	{
+
+		/*
+		 * Destroy the parallel context once we complete fetching all
+		 * the tuples, this will ensure that if in the same statement
+		 * we need to have Funnel node for multiple parts of statement,
+		 * it won't accumulate lot of dsm's and workers can be made
+		 * available to use by other parts of statement.
+		 */
+		FinishParallelSetupAndAccumStats(node);
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+
+	FinishParallelSetupAndAccumStats(node);
+}
+
+/*
+ * funnel_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in funnel scan and if there is no
+ *  data available from queues or no worker is available, it does
+ *  fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while ((!funnelstate->all_workers_done  && funnelstate->fs_workersReady) ||
+			!funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *		FinishParallelSetupAndAccumStats
+ *
+ *		Destroy the setup for parallel workers.  Collect all the
+ *		stats after workers are stopped, else some work done by
+ *		workers won't be accounted.
+ * ----------------------------------------------------------------
+ */
+void
+FinishParallelSetupAndAccumStats(FunnelState *node)
+{
+	if (node->pcxt)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		if (node->fs_workersReady)
+		{
+			TupleQueueFunnelShutdown(node->funnel);
+			WaitForParallelWorkersToFinish(node->pcxt);
+		}
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+		node->funnel = NULL;
+
+		/*
+		 * Aggregate the buffer usage stats from all workers.  This is
+		 * required by external modules like pg_stat_statements.
+		 */
+		ExecAccumulateBufUsageInfo(node);
+
+		/*
+		 * Aggregate instrumentation information of all the backend
+		 * workers for Funnel node.  This has to be done before we
+		 * destroy the parallel context.
+		 */
+		if (node->ss.ps.state->es_instrument)
+			ExecAccumulateInstInfo(node);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+		node->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateInstInfo
+ *
+ *		Accumulate instrumentation information of all the workers
+ * ----------------------------------------------------------------
+ */
+void ExecAccumulateInstInfo(FunnelState *node)
+{
+	int i;
+	Instrumentation *instrument_worker;
+	int nworkers;
+	char *inst_info_workers;
+	
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		inst_info_workers = node->inst_options_space;
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(node->ss.ps.instrument, instrument_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateBufUsageInfo
+ *
+ *		Accumulate buffer usage information of all the workers
+ * ----------------------------------------------------------------
+ */
+void ExecAccumulateBufUsageInfo(FunnelState *node)
+{
+	int i;
+	int nworkers;
+	BufferUsage *buffer_usage_worker;
+	char *buffer_usage;
+
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		buffer_usage = node->buffer_usage_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * sizeof(BufferUsage)));
+			BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans a relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.  We want to gracefully shutdown all the
+	 * workers so that they should be able to propagate any error
+	 * or other information to master backend before dying.
+	 */
+	FinishParallelSetupAndAccumStats(node);
+
+	ExecReScan(node->ss.ps.lefttree);
+}
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index e66bcda..c447062 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -144,6 +144,7 @@ ExecNestLoop(NestLoopState *node)
 			{
 				NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
 				int			paramno = nlp->paramno;
+				TupleDesc	tdesc = outerTupleSlot->tts_tupleDescriptor;
 				ParamExecData *prm;
 
 				prm = &(econtext->ecxt_param_exec_vals[paramno]);
@@ -154,6 +155,7 @@ ExecNestLoop(NestLoopState *node)
 				prm->value = slot_getattr(outerTupleSlot,
 										  nlp->paramval->varattno,
 										  &(prm->isnull));
+				prm->ptype = tdesc->attrs[nlp->paramval->varattno-1]->atttypid;
 				/* Flag parameter value as changed */
 				innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
 													 paramno);
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..09b7e07
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	shm_toc		*toc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.  We set 'toc' (place to lookup parallel scan
+	 * descriptor) as retrievied by attaching to dsm for parallel workers
+	 * whereas master backend stores it directly in partial scan state node
+	 * after initializing workers. 
+	 */
+	toc = GetParallelShmToc();
+	if (toc)
+		node->ss.ps.toc = toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize
+	 * it during ExecutorStart phase, however we need ParallelHeapScanDesc
+	 * to initialize the scan in case of this node and the same is
+	 * initialized by the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic shared
+		 * memory segment by master backend, parallel workers and local scan by
+		 * master backend retrieve it from shared memory.  If the scan descriptor
+		 * is available on first execution, then we need to re-initialize for
+		 * rescan.
+		 */
+		Assert(node->ss.ps.toc);
+	
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b348bfd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -75,6 +75,13 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Result node can be added as a gating node on top of PartialSeqScan
+	 * node, so need to percolate toc information to outer node.
+	 */
+	if (node->ps.toc)
+		outerPlanState(node)->toc = node->ps.toc;
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9eb4d63..6afd55a 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -30,11 +30,14 @@
 #include <math.h>
 
 #include "access/htup_details.h"
+#include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeSubplan.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
@@ -281,12 +284,14 @@ ExecScanSubPlan(SubPlanState *node,
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -399,6 +404,7 @@ ExecScanSubPlan(SubPlanState *node,
 			prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col, &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 
@@ -551,6 +557,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		 !TupIsNull(slot);
 		 slot = ExecProcNode(planstate))
 	{
+		TupleDesc	tdesc = slot->tts_tupleDescriptor;
 		int			col = 1;
 		ListCell   *plst;
 		bool		isnew;
@@ -568,6 +575,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col,
 										  &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 		slot = ExecProject(node->projRight, NULL);
@@ -954,6 +962,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *l;
 	bool		found = false;
 	ArrayBuildStateAny *astate = NULL;
+	Oid			ptype;
 
 	if (subLinkType == ANY_SUBLINK ||
 		subLinkType == ALL_SUBLINK)
@@ -961,6 +970,8 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	if (subLinkType == CTE_SUBLINK)
 		elog(ERROR, "CTE subplans should not be executed via ExecSetParamPlan");
 
+	ptype = exprType((Node *) node->xprstate.expr);
+
 	/* Initialize ArrayBuildStateAny in caller's context, if needed */
 	if (subLinkType == ARRAY_SUBLINK)
 		astate = initArrayResultAny(subplan->firstColType,
@@ -983,12 +994,14 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -1011,6 +1024,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(true);
+			prm->ptype = ptype;
 			prm->isnull = false;
 			found = true;
 			break;
@@ -1062,6 +1076,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 			prm->execPlan = NULL;
 			prm->value = heap_getattr(node->curTuple, i, tdesc,
 									  &(prm->isnull));
+			prm->ptype = tdesc->attrs[i-1]->atttypid;
 			i++;
 		}
 	}
@@ -1084,6 +1099,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 											true);
 		prm->execPlan = NULL;
 		prm->value = node->curArray;
+		prm->ptype = ptype;
 		prm->isnull = false;
 	}
 	else if (!found)
@@ -1096,6 +1112,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(false);
+			prm->ptype = ptype;
 			prm->isnull = false;
 		}
 		else
@@ -1108,6 +1125,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 				prm->execPlan = NULL;
 				prm->value = (Datum) 0;
+				prm->ptype = VOIDOID;
 				prm->isnull = true;
 			}
 		}
@@ -1238,3 +1256,47 @@ ExecAlternativeSubPlan(AlternativeSubPlanState *node,
 					   isNull,
 					   isDone);
 }
+
+/*
+ * ExecAndFormSerializeParamExec
+ *
+ * Execute the subplan stored in PARAM_EXEC param if it is not executed
+ * till now and form the serialized structure required for passing to
+ * worker backend.
+ */
+List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params)
+{
+	List	*lparam = NIL;
+	SerializedParamExecData *sparamdata;
+	ParamExecData *prm;
+	int		paramid;
+
+	paramid = -1;
+	while ((paramid = bms_next_member(params, paramid)) >= 0)
+	{
+		/*
+		 * PARAM_EXEC params (internal executor parameters) are stored in the
+		 * ecxt_param_exec_vals array, and can be accessed by array index.
+		 */
+		sparamdata = palloc0(sizeof(SerializedParamExecData));
+
+		prm = &(econtext->ecxt_param_exec_vals[paramid]);
+		if (prm->execPlan != NULL)
+		{
+			/* Parameter not evaluated yet, so go do it */
+			ExecSetParamPlan(prm->execPlan, econtext);
+			/* ExecSetParamPlan should have processed this param... */
+			Assert(prm->execPlan == NULL);
+		}
+
+		sparamdata->paramid	= paramid;
+		sparamdata->ptype = prm->ptype;
+		sparamdata->value = prm->value;
+		sparamdata->isnull = prm->isnull;
+
+		lparam = lappend(lparam, sparamdata);
+	}
+
+	return lparam;
+}
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 472de41..0f7906b 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1774,7 +1774,7 @@ spi_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		store tuple retrieved by Executor into SPITupleTable
  *		of current SPI procedure
  */
-void
+bool
 spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	SPITupleTable *tuptable;
@@ -1808,6 +1808,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 	(tuptable->free)--;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..39acda7
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,304 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static bool
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result == SHM_MQ_DETACHED)
+		return false;
+	else if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+
+	return true;
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Detach all tuple queues that belong to funnel.
+ */
+void
+TupleQueueFunnelShutdown(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		int		i;
+		shm_mq_handle *mqh;
+		shm_mq	   *mq;
+		for (i = 0; i < funnel->nqueues; i++)
+		{
+			mqh = funnel->queue[i];
+			mq = shm_mq_get_queue(mqh);
+			shm_mq_detach(mq);
+		}
+	}
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is until the number of remaining queues fall below that pointer and
+		 * at that point make the pointer point to the first queue.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+
+			memmove(&funnel->queue[funnel->nextqueue],
+					&funnel->queue[funnel->nextqueue + 1],
+					sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+
+			if (funnel->nextqueue >= funnel->nqueues)
+				funnel->nextqueue = 0;
+
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+
+			continue;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/executor/tstoreReceiver.c b/src/backend/executor/tstoreReceiver.c
index c1fdeb7..b0862ae 100644
--- a/src/backend/executor/tstoreReceiver.c
+++ b/src/backend/executor/tstoreReceiver.c
@@ -37,8 +37,8 @@ typedef struct
 } TStoreState;
 
 
-static void tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
-static void tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
 
 
 /*
@@ -90,19 +90,21 @@ tstoreStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the easy case where we don't have to detoast.
  */
-static void
+static bool
 tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
 
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the case where we have to detoast any toasted values.
  */
-static void
+static bool
 tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
@@ -152,6 +154,8 @@ tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 	/* And release any temporary detoasted values */
 	for (i = 0; i < nfree; i++)
 		pfree(DatumGetPointer(myState->tofree[i]));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ab03888..a200eca 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -362,6 +362,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4227,6 +4264,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 4176393..8944195 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3395,3 +3395,25 @@ raw_expression_tree_walker(Node *node,
 	}
 	return false;
 }
+
+/*
+ * planstate_tree_walker
+ *
+ * This routine will invoke walker on the node passed.  This is a useful
+ * way of starting the recursion when the walker's normal change of state
+ * is not appropriate for the outermost PlanState node.
+ */
+bool
+planstate_tree_walker(Node *node,
+					  ParallelContext *pcxt,
+					  bool (*walker) (),
+					  void *context)
+{
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex plan */
+	check_stack_depth();
+
+	return walker(node, pcxt, context);
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 01ae278..429d017 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -447,6 +447,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -3005,6 +3023,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..0050195 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, same as in original Param */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,355 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+						num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
+
+/*
+ * Estimate the amount of space required to serialize the PARAM_EXEC
+ * parameters.
+ */
+Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals)
+{
+	Size		size;
+	ListCell	*lparam;
+
+	/*
+	 * Add space required for saving number of PARAM_EXEC parameters
+	 * that needs to be serialized.
+	 */
+	size = sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		length;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		length = sizeof(SerializedParamExecData);
+
+		get_typlenbyval(param_val->ptype, &typLen, &typByVal);
+
+		/*
+		 * pass-by-value parameters are directly stored in
+		 * SerializedParamExternData, so no need of additional
+		 * space for them.
+		 */
+		if (!(typByVal || param_val->isnull))
+		{
+			length += datumGetSize(param_val->value, typByVal, typLen);
+			size = add_size(size, length);
+
+			/* Allow space for terminating zero-byte */
+			size = add_size(size, 1);
+		}
+		else
+			size = add_size(size, length);
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the PARAM_EXEC parameters into the memory, beginning at
+ * start_address.  maxsize should be at least as large as the value
+ * returned by EstimateExecParametersSpace.
+ */
+void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExecData *retval;
+	ListCell	*lparam;
+
+	/*
+	 * First, we store the number of PARAM_EXEC parameters that needs to
+	 * be serialized.
+	 */
+	if (serialized_param_exec_vals)
+		* (int *) start_address = list_length(serialized_param_exec_vals);
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+
+	curptr = start_address + sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		retval = (SerializedParamExecData*) curptr;
+
+		retval->paramid	= param_val->paramid;
+		retval->value = param_val->value;
+		retval->isnull = param_val->isnull;
+		retval->ptype = param_val->ptype;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(retval->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(retval->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(retval->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreExecParams
+ *		Restore PARAM_EXEC parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+List *
+RestoreExecParams(char *start_address)
+{
+	List			*lparamexecvals = NIL;
+	//Size			size;
+	int				num_params,i;
+	char			*curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExecData *nprm;
+		SerializedParamExecData	*outparam;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExecData *) curptr;
+
+		outparam = palloc0(sizeof(SerializedParamExecData));
+
+		/* copy the parameter info */
+		outparam->isnull = nprm->isnull;
+		outparam->value = nprm->value;
+		outparam->paramid = nprm->paramid;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			outparam->value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+
+		lparamexecvals = lappend(lparamexecvals, outparam);
+	}
+
+	return lparamexecvals;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index f5a40fb..4749efe 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -29,6 +29,7 @@
 #include <math.h>
 
 #include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
 #include "nodes/readfuncs.h"
 
 
@@ -1391,6 +1392,125 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(isUpsert);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlan
+ */
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readScan
+ */
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
+
+/*
+ * _readResult
+ */
+static Result *
+_readResult(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(Result);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_NODE_FIELD(resconstantqual);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1532,6 +1652,14 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
+	else if (MATCH("RESULT", 6))
+		return_value = _readResult();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1fd8763..6f45da6 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -424,6 +424,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index c2b2b76..dec2853 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,8 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +103,15 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -287,6 +293,86 @@ cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel)
 }
 
 /*
+ * cost_patialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_patialseqscan(Path *path, PlannerInfo *root,
+				   RelOptInfo *baserel, ParamPathInfo *param_info,
+				   int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan
+	 * via the ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * partial sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_funnel
+ *	  Determines and returns the cost of funnel path.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..bc71737
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0
+	 * and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast thousand pages to scan for each worker.
+	 * This number is somewhat arbitratry, however we don't want to
+	 * spawn workers to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+	
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	if (parallel_seqscan_degree <= estimated_parallel_workers)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = estimated_parallel_workers;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false,
+										 num_parallel_workers);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index b47ef46..4a57716 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,10 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+							List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+								FunnelPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -103,6 +107,11 @@ static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+									Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+					Index scanrelid, int nworkers,
+					Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -233,6 +242,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -272,6 +282,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 			plan = create_unique_plan(root,
 									  (UniquePath *) best_path);
 			break;
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
 				 (int) best_path->pathtype);
@@ -355,6 +369,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												   scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -559,6 +580,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1186,6 +1209,107 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path)
+{
+	Scan    *scan_plan;
+	Plan	*subplan;
+	List	*tlist;
+	RelOptInfo *rel = best_path->path.parent;
+	Index	scan_relid = best_path->path.parent->relid;
+
+	/*
+	 * For table scans, rather than using the relation targetlist (which is
+	 * only those Vars actually needed by the query), we prefer to generate a
+	 * tlist containing all Vars in order.  This will allow the executor to
+	 * optimize away projection of the table tuples, if possible.  (Note that
+	 * planner.c may replace the tlist we generate here, forcing projection to
+	 * occur.)
+	 */
+	if (use_physical_tlist(root, rel))
+	{
+			tlist = build_physical_tlist(root, rel);
+			/* if fail because of dropped cols, use regular method */
+			if (tlist == NIL)
+				tlist = build_path_tlist(root, &best_path->path);
+	}
+	else
+	{
+		tlist = build_path_tlist(root, &best_path->path);
+	}
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3441,6 +3565,45 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 70c2fcf..1cc2cdf 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -295,6 +295,52 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+								   List *rangetable,
+								   int num_exec_params)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = num_exec_params;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 90e13e4..c611e30 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -440,6 +440,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -461,6 +462,26 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.  We don't want to do this assignment after
+				 * fixing references as that will be done separately for partial
+				 * sequence scan node.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
@@ -2251,6 +2272,45 @@ fix_opfuncids_walker(Node *node, void *context)
 }
 
 /*
+ * fix_node_funcids
+ *		Set the opfuncid (procedure OID) in an OpExpr node,
+ *		for plan tree.
+ *
+ * We need it mainly to fix the opfuncid in nodes of plantree
+ * after reading the planned statement by worker backend.
+ * Currently the support of nodes that could be executed by
+ * worker backend are limited, so we can enhance this API based
+ * on it's usage in future.
+ */
+void
+fix_node_funcids(Plan *node)
+{
+	/*
+	 * do nothing when we get to the end of a leaf on tree.
+	 */
+	if (node == NULL)
+		return;
+
+	fix_opfuncids((Node*) node->qual);
+	fix_opfuncids((Node*) node->targetlist);
+
+	switch (nodeTag(node))
+	{
+		case T_Result:
+			fix_opfuncids((Node*) (((Result *)node)->resconstantqual));
+			break;
+		case T_PartialSeqScan:
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	fix_node_funcids(node->lefttree);
+	fix_node_funcids(node->righttree);
+}
+
+/*
  * set_opfuncid
  *		Set the opfuncid (procedure OID) in an OpExpr node,
  *		if it hasn't been set already.
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index f80abb4..0fe3ac3 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2217,6 +2217,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3fe2712..5098e70 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -726,6 +726,54 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_patialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+				   Path* subpath, int nworkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nworkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..4aec92a 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+	pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 87f5430..c0f09ab 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -835,6 +836,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index daca634..5f0d367 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -746,6 +746,15 @@ shm_mq_detach(shm_mq *mq)
 }
 
 /*
+ * Get the shm_mq from handle.
+ */
+shm_mq *
+shm_mq_get_queue(shm_mq_handle *mqh)
+{
+	return mqh->mqh_queue;
+}
+
+/*
  * Write bytes into a shared message queue.
  */
 static shm_mq_result
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..57014ee 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -44,9 +45,10 @@
  *		dummy DestReceiver functions
  * ----------------
  */
-static void
+static bool
 donothingReceive(TupleTableSlot *slot, DestReceiver *self)
 {
+	return true;
 }
 
 static void
@@ -129,6 +131,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +167,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +210,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +255,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c18298..89938e2 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,8 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/execParallel.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -1192,6 +1194,94 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										"worker plan",
+										ALLOCSET_DEFAULT_MINSIZE,
+										ALLOCSET_DEFAULT_INITSIZE,
+										ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	receiver = CreateDestReceiver(DestTupleQueue);
+	SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	PopulateParamExecParams(queryDesc, parallelstmt->serialized_param_exec_vals);
+
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required
+	 * by plugins like pg_stat_statements to report the total usage for
+	 * statement execution.
+	 */
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index bcffd85..50c717b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1127,7 +1127,13 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			if (!ok)
 				break;
 
-			(*dest->receiveSlot) (slot, dest);
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
 
 			ExecClearTuple(slot);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3038d7c..6d9924b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -603,6 +603,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2542,6 +2544,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2729,6 +2741,26 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 06dfc06..32ff938 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -291,6 +291,8 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -501,6 +503,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index eec7c95..4700241 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -121,9 +122,15 @@ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/printtup.h b/src/include/access/printtup.h
index 46c4148..92ec882 100644
--- a/src/include/access/printtup.h
+++ b/src/include/access/printtup.h
@@ -25,11 +25,11 @@ extern void SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist,
 
 extern void debugStartup(DestReceiver *self, int operation,
 			 TupleDesc typeinfo);
-extern void debugtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool debugtup(TupleTableSlot *slot, DestReceiver *self);
 
 /* XXX these are really in executor/spi.c */
 extern void spi_dest_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-extern void spi_printtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool spi_printtup(TupleTableSlot *slot, DestReceiver *self);
 
 #endif   /* PRINTTUP_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 1b9b299..f28e9f9 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,15 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber	phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +58,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel; /* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
new file mode 100644
index 0000000..73006a8
--- /dev/null
+++ b/src/include/executor/execParallel.h
@@ -0,0 +1,65 @@
+/*--------------------------------------------------------------------
+ * execParallel.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execParallel.h
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECPARALLEL_H
+#define EXECPARALLEL_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+#include "nodes/execnodes.h"
+#include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define	PARALLEL_KEY_PARAMS_EXEC	2
+#define PARALLEL_KEY_BUFF_USAGE		3
+#define PARALLEL_KEY_INST_OPTIONS	4
+#define PARALLEL_KEY_INST_INFO		5
+#define PARALLEL_KEY_TUPLE_QUEUE	6
+#define PARALLEL_KEY_SCAN			7
+
+extern int	parallel_seqscan_degree;
+
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+	char			*buffer_usage;
+} ParallelStmt;
+
+extern void InitializeParallelWorkers(PlanState *planstate,
+									  List *serialized_param_exec_vals,
+									  EState *estate,
+									  char **inst_options_space,
+									  char **buffer_usage_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+extern shm_toc *GetParallelShmToc(void);
+extern bool ExecParallelEstimate(Node *node, ParallelContext *pcxt,
+								 Size *pscan_size);
+extern bool ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
+									  Size *pscan_size);
+extern bool ExecParallelBufferUsageAccum(Node *node);
+extern void ExecAssociateBufferStatsToDSM(BufferUsage *buf_usage,
+							  ParallelStmt *parallel_stmt);
+#endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e60ab9f..c3e4e7f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -273,6 +273,8 @@ extern TupleDesc ExecCleanTypeFromTL(List *targetList, bool hasoid);
 extern TupleDesc ExecTypeFromExprList(List *exprList);
 extern void ExecTypeSetColNames(TupleDesc typeInfo, List *namesList);
 extern void UpdateChangedParamSet(PlanState *node, Bitmapset *newchg);
+extern void PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals);
 
 typedef struct TupOutputState
 {
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c9a2129..0c7847d 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+	InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..27d0b3d
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void FinishParallelSetupAndAccumStats(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..47b8f73
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+											EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index 3732ad4..21c745e 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -24,4 +24,7 @@ extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
 
 extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
 
+extern List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params);
+
 #endif   /* NODESUBPLAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..d2ddb6e
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void TupleQueueFunnelShutdown(TupleQueueFunnel *funnel);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0a92cc4..09ffb08 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -401,6 +403,18 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
+	 * This is required to collect buffer usage stats from parallel
+	 * workers when requested by plugins.
+	 */
+	bool		total_time;	/* total time spent in ExecutorRun */
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1050,6 +1064,11 @@ typedef struct PlanState
 	 * State for management of parameter-change-driven rescanning
 	 */
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
+	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc			*toc;
 
 	/*
 	 * Other run-time state needed by most if not all node types.
@@ -1264,6 +1283,45 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	char			*buffer_usage_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodeFuncs.h b/src/include/nodes/nodeFuncs.h
index 7b1b1d6..df00d3d 100644
--- a/src/include/nodes/nodeFuncs.h
+++ b/src/include/nodes/nodeFuncs.h
@@ -13,6 +13,7 @@
 #ifndef NODEFUNCS_H
 #define NODEFUNCS_H
 
+#include "access/parallel.h"
 #include "nodes/parsenodes.h"
 
 
@@ -63,4 +64,7 @@ extern Node *query_or_expression_tree_mutator(Node *node, Node *(*mutator) (),
 extern bool raw_expression_tree_walker(Node *node, bool (*walker) (),
 												   void *context);
 
+extern bool planstate_tree_walker(Node *node, ParallelContext *pcxt,
+					  bool (*walker) (), void *context);
+
 #endif   /* NODEFUNCS_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 669a0af..9eb344f 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -99,6 +101,8 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -223,6 +227,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..21c6f7a 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -14,6 +14,8 @@
 #ifndef PARAMS_H
 #define PARAMS_H
 
+#include "nodes/pg_list.h"
+
 /* To avoid including a pile of parser headers, reference ParseState thus: */
 struct ParseState;
 
@@ -96,11 +98,47 @@ typedef struct ParamExecData
 {
 	void	   *execPlan;		/* should be "SubPlanState *" */
 	Datum		value;
+	/*
+	 * parameter's datatype, or 0.  This is required so that
+	 * datum value can be read and used for other purposes like
+	 * passing it to worker backend via shared memory.  This is
+	 * required only for evaluation of initPlan's, however for
+	 * consistency we set this for Subplan as well.  We left it
+	 * for other cases like CTE or RecursiveUnion cases where this
+	 * structure is not used for evaluation of subplans.
+	 */
+	Oid			ptype;
 	bool		isnull;
 } ParamExecData;
 
+/*
+ * This structure is used to pass PARAM_EXEC parameters to backend
+ * workers.  For each PARAM_EXEC parameter, pass this structure
+ * followed by value except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExecData
+{
+	int			paramid;			/* parameter id of this param */
+	Size		length;			/* length of parameter value */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+	Datum		value;
+	bool		isnull;
+} SerializedParamExecData;
+
 
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
+extern Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals);
+extern void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address);
+List *
+RestoreExecParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index ff45838..9cb1853 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -294,6 +294,22 @@ typedef Scan SeqScan;
 typedef Scan SampleScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index c652213..1eea6d3 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -752,6 +752,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 24003ae..a1c9f59 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,13 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +55,11 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -69,6 +79,11 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
+extern void cost_patialseqscan(Path *path, PlannerInfo *root,
+						RelOptInfo *baserel, ParamPathInfo *param_info,
+						int nworkers);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 89c8ded..e84f925 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,10 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 									Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer, int nworkers);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+					RelOptInfo *rel, Path *subpath, int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 3e2378a..bd8eb67 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 52b077a..67a8582 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -133,6 +133,7 @@ extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
  */
 extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
 extern void fix_opfuncids(Node *node);
+extern void fix_node_funcids(Plan *node);
 extern void set_opfuncid(OpExpr *opexpr);
 extern void set_sa_opfuncid(ScalarArrayOpExpr *opexpr);
 extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index b10a504..8c7ce75 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+											List *rangetable, int num_exec_params);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index 085a8a7..74e288c 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -65,6 +65,9 @@ extern void shm_mq_set_handle(shm_mq_handle *, BackgroundWorkerHandle *);
 /* Break connection. */
 extern void shm_mq_detach(shm_mq *);
 
+/* Get the shm_mq from handle. */
+extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
+
 /* Send or receive messages. */
 extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
 			Size nbytes, const void *data, bool nowait);
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..91acd60 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
@@ -103,7 +104,9 @@ typedef enum
  *		pointers that the executor must call.
  *
  * Note: the receiveSlot routine must be passed a slot containing a TupleDesc
- * identical to the one given to the rStartup routine.
+ * identical to the one given to the rStartup routine.  It returns bool where
+ * a "true" value means "continue processing" and a "false" value means
+ * "stop early, just as if we'd reached the end of the scan".
  * ----------------
  */
 typedef struct _DestReceiver DestReceiver;
@@ -111,7 +114,7 @@ typedef struct _DestReceiver DestReceiver;
 struct _DestReceiver
 {
 	/* Called for each tuple to be output: */
-	void		(*receiveSlot) (TupleTableSlot *slot,
+	bool		(*receiveSlot) (TupleTableSlot *slot,
 											DestReceiver *self);
 	/* Per-executor-run initialization and shutdown: */
 	void		(*rStartup) (DestReceiver *self,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 96c5b8b..6f319c1 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -19,6 +19,7 @@
 #ifndef TCOPPROT_H
 #define TCOPPROT_H
 
+#include "executor/execParallel.h"
 #include "nodes/params.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -84,5 +85,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index c0f9cb9..38b91f8 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#280Jeff Davis
pgsql@j-davis.com
In reply to: Amit Kapila (#279)
Re: Parallel Seq Scan

[Jumping in without catching up on entire thread. Please let me know
if these questions have already been covered.]

1. Can you change the name to something like ParallelHeapScan?
Parallel Sequential is a contradiction. (I know this is bikeshedding
and I won't protest further if you keep the name.)

2. Where is the speedup coming from? How much of it is CPU and IO
overlapping (i.e. not leaving disk or CPU idle while the other is
working), and how much from the CPU parallelism? I know this is
difficult to answer rigorously, but it would be nice to have some
breakdown even if for a specific machine.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#281Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Davis (#280)
Re: Parallel Seq Scan

On Tue, Jun 30, 2015 at 4:00 AM, Jeff Davis <pgsql@j-davis.com> wrote:

[Jumping in without catching up on entire thread.

No problem.

Please let me know
if these questions have already been covered.]

1. Can you change the name to something like ParallelHeapScan?
Parallel Sequential is a contradiction. (I know this is bikeshedding
and I won't protest further if you keep the name.)

For what you are asking to change name for?
We have two nodes in patch (Funnel and PartialSeqScan). Funnel is
the name given to node because it is quite generic and can be
used in multiple ways (other than plain parallel sequiantial scan)
and other node is named as PartialSeqScan because it is used
for doing the part of sequence scan.

2. Where is the speedup coming from? How much of it is CPU and IO
overlapping (i.e. not leaving disk or CPU idle while the other is
working), and how much from the CPU parallelism? I know this is
difficult to answer rigorously, but it would be nice to have some
breakdown even if for a specific machine.

Yes, you are right and we have done quite some testing (on the hardware
available) with this patch (with different approaches) to see how much
difference it creates for IO and CPU, with respect to IO we have found
that it doesn't help much [1]/messages/by-id/CAA4eK1JHCmN2X1LjQ4bOmLApt+btOuid5Vqqk5G6dDFV69iyHg@mail.gmail.com, though it helps when the data is cached
and there are really good benefits in terms of CPU [2]Refer slides 14-15 for the presentation in PGCon, I can repost the data here if required. https://www.pgcon.org/2015/schedule/events/785.en.html.

In terms of completeness, I think we should add some documentation
for this patch, one way is to update about the execution mechanism in
src/backend/access/transam/README.parallel and then explain about
new configuration knobs in documentation (.sgml files). Also we
can have a separate page in itself in documentation under Server
Programming Section (Parallel Query -> Parallel Scan;
Parallel Scan Examples; ...)

Another thing to think about this patch at this stage do we need to
breakup this patch and if yes, how to break it up into multiple patches,
so that it can be easier to complete the review. I could see that it
can be splitted into 2 or 3 patches.
a. Infrastructure for parallel execution, like some of the stuff in
execparallel.c, heapam.c,tqueue.c, etc and all other generic
(non-nodes specific) code.
b. Nodes (Funnel and PartialSeqScan) specific code for optimiser
and executor.
c. Documentation

Suggestions?

[1]: /messages/by-id/CAA4eK1JHCmN2X1LjQ4bOmLApt+btOuid5Vqqk5G6dDFV69iyHg@mail.gmail.com
/messages/by-id/CAA4eK1JHCmN2X1LjQ4bOmLApt+btOuid5Vqqk5G6dDFV69iyHg@mail.gmail.com
[2]: Refer slides 14-15 for the presentation in PGCon, I can repost the data here if required. https://www.pgcon.org/2015/schedule/events/785.en.html
data here if required.
https://www.pgcon.org/2015/schedule/events/785.en.html

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#282Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Kapila (#281)
Re: Parallel Seq Scan

On 2015-07-01 PM 02:37, Amit Kapila wrote:

In terms of completeness, I think we should add some documentation
for this patch, one way is to update about the execution mechanism in
src/backend/access/transam/README.parallel and then explain about
new configuration knobs in documentation (.sgml files). Also we
can have a separate page in itself in documentation under Server
Programming Section (Parallel Query -> Parallel Scan;
Parallel Scan Examples; ...)

Another thing to think about this patch at this stage do we need to
breakup this patch and if yes, how to break it up into multiple patches,
so that it can be easier to complete the review. I could see that it
can be splitted into 2 or 3 patches.
a. Infrastructure for parallel execution, like some of the stuff in
execparallel.c, heapam.c,tqueue.c, etc and all other generic
(non-nodes specific) code.
b. Nodes (Funnel and PartialSeqScan) specific code for optimiser
and executor.
c. Documentation

Suggestions?

A src/backend/executor/README.parallel?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#283Jeff Davis
pgsql@j-davis.com
In reply to: Amit Kapila (#281)
Re: Parallel Seq Scan

On Wed, 2015-07-01 at 11:07 +0530, Amit Kapila wrote:

For what you are asking to change name for?

There are still some places, at least in the comments, that call it a
parallel sequential scan.

a. Infrastructure for parallel execution, like some of the stuff in
execparallel.c, heapam.c,tqueue.c, etc and all other generic
(non-nodes specific) code.

Did you consider passing tuples through the tqueue by reference rather
than copying? The page should be pinned by the worker process, but
perhaps that's a bad assumption to make?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#284Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Jeff Davis (#283)
Re: Parallel Seq Scan

a. Infrastructure for parallel execution, like some of the stuff in
execparallel.c, heapam.c,tqueue.c, etc and all other generic
(non-nodes specific) code.

Did you consider passing tuples through the tqueue by reference rather
than copying? The page should be pinned by the worker process, but
perhaps that's a bad assumption to make?

Is the upcoming PartialAggregate/FinalAggregate a solution for the problem?
More or less, the Funnel node run on single core has to process massive
amount of tuples that are fetched in parallel.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: Jeff Davis [mailto:pgsql@j-davis.com]
Sent: Wednesday, July 01, 2015 4:51 PM
To: Amit Kapila
Cc: Robert Haas; Haribabu Kommi; Andres Freund; Kaigai Kouhei(海外 浩平); Amit
Langote; Amit Langote; Fabrízio Mello; Thom Brown; Stephen Frost; pgsql-hackers
Subject: Re: [HACKERS] Parallel Seq Scan

On Wed, 2015-07-01 at 11:07 +0530, Amit Kapila wrote:

For what you are asking to change name for?

There are still some places, at least in the comments, that call it a
parallel sequential scan.

a. Infrastructure for parallel execution, like some of the stuff in
execparallel.c, heapam.c,tqueue.c, etc and all other generic
(non-nodes specific) code.

Did you consider passing tuples through the tqueue by reference rather
than copying? The page should be pinned by the worker process, but
perhaps that's a bad assumption to make?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#285Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Davis (#283)
Re: Parallel Seq Scan

On Wed, Jul 1, 2015 at 1:21 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2015-07-01 at 11:07 +0530, Amit Kapila wrote:

For what you are asking to change name for?

There are still some places, at least in the comments, that call it a
parallel sequential scan.

In the initial version of patch, there was only one node parallel seqscan
node and the occurrences you are seeing are left over's, I will change
them in next patch.

a. Infrastructure for parallel execution, like some of the stuff in
execparallel.c, heapam.c,tqueue.c, etc and all other generic
(non-nodes specific) code.

Did you consider passing tuples through the tqueue by reference rather
than copying? The page should be pinned by the worker process, but
perhaps that's a bad assumption to make?

Yes, IIRC there was some discussion happened and I haven't used for
the reason you mentioned. It doesn't same sane to hold the pin on
page for long time (we need to retain the pin till master backend processes
that tuple).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#286Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Amit Kapila (#281)
Re: Parallel Seq Scan

On 01/07/15 17:37, Amit Kapila wrote:

On Tue, Jun 30, 2015 at 4:00 AM, Jeff Davis <pgsql@j-davis.com
<mailto:pgsql@j-davis.com>> wrote:

[Jumping in without catching up on entire thread.

[...]

.

2. Where is the speedup coming from? How much of it is CPU and IO
overlapping (i.e. not leaving disk or CPU idle while the other is
working), and how much from the CPU parallelism? I know this is
difficult to answer rigorously, but it would be nice to have some
breakdown even if for a specific machine.

Yes, you are right and we have done quite some testing (on the hardware
available) with this patch (with different approaches) to see how much
difference it creates for IO and CPU, with respect to IO we have found
that it doesn't help much [1], though it helps when the data is cached
and there are really good benefits in terms of CPU [2].

[...]

I assume your answer refers to a table on one spindle of spinning rust.

QUESTIONS:

1. what about I/O using an SSD?

2. what if the table is in a RAID array (of various types), would
having the table spread over multiple spindles help?

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#287Amit Kapila
amit.kapila16@gmail.com
In reply to: Gavin Flower (#286)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Jul 2, 2015 at 1:47 AM, Gavin Flower <GavinFlower@archidevsys.co.nz>
wrote:

On 01/07/15 17:37, Amit Kapila wrote:

Yes, you are right and we have done quite some testing (on the hardware
available) with this patch (with different approaches) to see how much
difference it creates for IO and CPU, with respect to IO we have found
that it doesn't help much [1], though it helps when the data is cached
and there are really good benefits in terms of CPU [2].

[...]

I assume your answer refers to a table on one spindle of spinning rust.

QUESTIONS:

1. what about I/O using an SSD?

2. what if the table is in a RAID array (of various types), would
having the table spread over multiple spindles help?

I think it will be helpful if we could get the numbers on more type of
m/c's,
please feel free to test and share the data if you have access to such
m/c's.

Attached, find the rebased version of patch.

Note - You need to first apply the assess-parallel-safety patch which you
can find at:
/messages/by-id/CAA4eK1JjsfE_dOsHTr_z1P_cBKi_X4C4X3d7Nv=VWX9fs7qdJA@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_v16.patchapplication/octet-stream; name=parallel_seqscan_v16.patchDownload
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..639451a 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -26,9 +26,9 @@
 
 static void printtup_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-static void printtup(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_20(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
 static void printtup_shutdown(DestReceiver *self);
 static void printtup_destroy(DestReceiver *self);
 
@@ -299,7 +299,7 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
  *		printtup --- print a tuple in protocol 3.0
  * ----------------
  */
-static void
+static bool
 printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -376,13 +376,15 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
  *		printtup_20 --- print a tuple in protocol 2.0
  * ----------------
  */
-static void
+static bool
 printtup_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -452,6 +454,8 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
@@ -528,7 +532,7 @@ debugStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		debugtup - print one tuple for an interactive backend
  * ----------------
  */
-void
+bool
 debugtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -553,6 +557,8 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
 		printatt((unsigned) i + 1, typeinfo->attrs[i], value);
 	}
 	printf("\t----\n");
+
+	return true;
 }
 
 /* ----------------
@@ -564,7 +570,7 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
  * This is largely same as printtup_20, except we use binary formatting.
  * ----------------
  */
-static void
+static bool
 printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -636,4 +642,6 @@ printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a2e6b..f5242a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,9 +81,11 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode,
 						bool is_bitmapscan, bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(ParallelHeapScanDesc);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -223,7 +226,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -483,7 +489,18 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -506,6 +523,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -658,11 +678,19 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -760,7 +788,18 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				if (page >= scan->rs_nblocks)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -780,6 +819,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -921,11 +963,19 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan->rs_parallel);
+				finished = (page >= scan->rs_nblocks);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1321,7 +1371,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1331,7 +1381,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1340,7 +1390,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1349,7 +1399,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1358,7 +1408,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, false, allow_pagemode,
 								   false, true, false);
 }
@@ -1366,6 +1416,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode,
 					  bool is_bitmapscan, bool is_samplescan, bool temp_snap)
 {
@@ -1394,6 +1445,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1487,6 +1539,94 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = 0;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		A return value larger than the number of blocks to be scanned
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(ParallelHeapScanDesc parallel_scan)
+{
+	BlockNumber	page = InvalidBlockNumber;
+
+	/* we treat InvalidBlockNumber specially here to avoid overflow */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock != InvalidBlockNumber)
+		page = parallel_scan->phs_cblock++;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot		snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8904676..47063c7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4395,7 +4395,7 @@ copy_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * copy_dest_receive --- receive one tuple
  */
-static void
+static bool
 copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_copy    *myState = (DR_copy *) self;
@@ -4407,6 +4407,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 	/* And send the data */
 	CopyOneRowTo(cstate, InvalidOid, slot->tts_values, slot->tts_isnull);
 	myState->processed++;
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 41183f6..418b0f6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,7 +62,7 @@ typedef struct
 static ObjectAddress CreateAsReladdr = {InvalidOid, InvalidOid, 0};
 
 static void intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void intorel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool intorel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void intorel_shutdown(DestReceiver *self);
 static void intorel_destroy(DestReceiver *self);
 
@@ -482,7 +482,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * intorel_receive --- receive one tuple
  */
-static void
+static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
@@ -507,6 +507,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2b930f7..9d6b663 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -20,6 +20,7 @@
 #include "commands/defrem.h"
 #include "commands/prepare.h"
 #include "executor/hashjoin.h"
+#include "executor/nodeFunnel.h"
 #include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
@@ -730,6 +731,8 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -935,6 +938,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1101,6 +1110,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1248,6 +1259,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for Funnel node.  Though we already accumulate this
+	 * information when last tuple is fetched from Funnel node, this
+	 * is to cover cases when we don't fetch all tuples from a node
+	 * such as for Limit node.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+		FinishParallelSetupAndAccumStats((FunnelState *)planstate);
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1364,6 +1385,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
 			break;
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -1374,6 +1396,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2366,6 +2396,8 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5492e59..750a59c 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -56,7 +56,7 @@ typedef struct
 static int	matview_maintenance_depth = 0;
 
 static void transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static void refresh_matview_datafill(DestReceiver *dest, Query *query,
@@ -422,7 +422,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * transientrel_receive --- receive one tuple
  */
-static void
+static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
@@ -441,6 +441,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 08cba6f..be1f47e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -13,17 +13,17 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execProcnode.o execQual.o execScan.o execTuples.o \
+       execMain.o execParallel.o execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 04073d3..233e584 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -37,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -160,6 +162,14 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -463,6 +473,10 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_Funnel:
+		case T_PartialSeqScan:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..7a44462 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,8 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_PartialSeqScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2e23cc7..e252727 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -45,9 +45,11 @@
 #include "commands/matview.h"
 #include "commands/trigger.h"
 #include "executor/execdebug.h"
+#include "executor/execParallel.h"
 #include "foreign/fdwapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -323,6 +325,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	operation = queryDesc->operation;
 	dest = queryDesc->dest;
 
+	/* inform executor to collect buffer usage stats from parallel workers. */
+	estate->total_time = queryDesc->totaltime ? 1 : 0;
+
 	/*
 	 * startup tuple receiver, if we will be emitting tuples
 	 */
@@ -354,7 +359,15 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		(*dest->rShutdown) (dest);
 
 	if (queryDesc->totaltime)
+	{
+		/*
+		 * Accumulate the stats by parallel workers before stopping the
+		 * node.
+		 */
+		(void) planstate_tree_walker((Node*) queryDesc->planstate,
+									 NULL, ExecParallelBufferUsageAccum, 0);
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -1581,7 +1594,15 @@ ExecutePlan(EState *estate,
 		 * practice, this is probably always the case at this point.)
 		 */
 		if (sendTuples)
-			(*dest->receiveSlot) (slot, dest);
+		{
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
+		}
 
 		/*
 		 * Count tuples processed, if this is a SELECT.  (For other operation
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 0000000..33e83fe
--- /dev/null
+++ b/src/backend/executor/execParallel.c
@@ -0,0 +1,592 @@
+/*-------------------------------------------------------------------------
+ *
+ * execParallel.c
+ *	  Support routines for setting up backend workers for parallel execution.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void
+EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState* planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size);
+static void
+StorePlannedStmt(ParallelContext *pcxt, PlanState* planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * This is required for parallel plan execution to fetch the
+ * information from dsm.
+ */
+static shm_toc *parallel_shm_toc = NULL;
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of
+ * bind parameters, PARAM_EXEC parameters and instrumentation
+ * information that need to be retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	*params_exec_size = EstimateExecParametersSpace(serialized_param_exec_vals);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_exec_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the usage along with it's own, so account it for each worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 4);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters, PARAM_EXEC parameters and instrumentation
+ * information required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	*paramsdata;
+	char	*paramsexecdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Store PARAM_EXEC parameters list in dynamic shared memory.  This is
+	 * used for evaluation plan->initPlan params.
+	 */
+	paramsexecdata = shm_toc_allocate(pcxt->toc, params_exec_size);
+	SerializeExecParams(serialized_param_exec_vals, params_exec_size, paramsexecdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS_EXEC, paramsexecdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by
+	 * each worker.
+	 */
+	*buffer_usage_space =
+			shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePlannedStmtSpace
+ *
+ * Estimate the amount of space required to record information of
+ * planned statement and parallel node specific information that need
+ * to be copied to parallel workers.
+ */
+void
+EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState* planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size)
+{
+	/* Estimate space for planned statement. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	/* keys for planned statement information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	(void) planstate_tree_walker((Node*)planstate, pcxt, ExecParallelEstimate,
+								 pscan_size);
+}
+
+/*
+ * StorePlannedStmt
+ * 
+ * Sets up the planned statement and node specific information.
+ */
+void
+StorePlannedStmt(ParallelContext *pcxt, PlanState* planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size)
+{
+	char		*plannedstmtdata;
+
+	/* Store planned statement in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	(void) planstate_tree_walker((Node*)planstate, pcxt, ExecParallelInitializeDSM,
+								 &pscan_size);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of
+ * tuple queues that need to be established between parallel workers
+ * and master backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for parallel seq. scan specific contents. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queue's for backend worker's to
+ * return tuples to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  (Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.)
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ *	ExecParallelEstimate
+ *
+ *		Estimate the amount of space required to record information of
+ * parallel node that need to be copied to parallel workers.
+ */
+bool
+ExecParallelEstimate(Node *node, ParallelContext *pcxt,
+					 Size *pscan_size)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_ResultState:
+			{
+				PlanState *planstate = ((ResultState*)node)->ps.lefttree;
+
+				return planstate_tree_walker((Node*)planstate, pcxt,
+											 ExecParallelEstimate, pscan_size);
+			}
+		case T_PartialSeqScanState:
+			{
+				EState		*estate = ((PartialSeqScanState*)node)->ss.ps.state;
+
+				*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+				shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+				/* key for paratial scan information. */
+				shm_toc_estimate_keys(&pcxt->estimator, 1);
+				return true;
+			}
+		default:
+			break;
+	}
+
+	return false;
+}
+
+/*
+ *	ExecParallelInitializeDSM
+ *
+ *		Store the information of parallel node in dsm.
+ */
+bool
+ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
+						  Size *pscan_size)
+{
+	ParallelHeapScanDesc pscan;
+
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_ResultState:
+			{
+				PlanState *planstate = ((ResultState*)node)->ps.lefttree;
+
+				return planstate_tree_walker((Node*)planstate, pcxt,
+											 ExecParallelInitializeDSM, pscan_size);
+			}
+		case T_PartialSeqScanState:
+			{
+				EState	*estate = ((PartialSeqScanState*)node)->ss.ps.state;
+
+				/* Store parallel heap scan descriptor in dynamic shared memory. */
+				pscan = shm_toc_allocate(pcxt->toc, *pscan_size);
+				heap_parallelscan_initialize(pscan, ((PartialSeqScanState*)node)->ss.ss_currentRelation, estate->es_snapshot);
+				shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+				return true;
+			}
+		default:
+			break;
+	}
+
+	return false;
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ *	Sets up the required infrastructure for backend workers to
+ *	perform execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(PlanState *planstate,
+						  List *serialized_param_exec_vals,
+						  EState *estate,
+						  char **inst_options_space,
+						  char **buffer_usage_space,
+						  shm_mq_handle ***responseqp,
+						  ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size, params_exec_size, pscan_size, plannedstmt_size;
+	char		*plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt((PartialSeqScan *)planstate->plan,
+													 estate->es_range_table,
+													 estate->es_plannedstmt->nParamExec);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePlannedStmtSpace(pcxt, planstate, plannedstmt_str,
+							 &plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 serialized_param_exec_vals,
+									 estate->es_instrument, &params_size,
+									 &params_exec_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePlannedStmt(pcxt, planstate, plannedstmt_str,
+					 plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 serialized_param_exec_vals,
+							 estate->es_instrument,
+							 params_size,
+							 params_exec_size,
+							 inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the bind parameters, PARAM_EXEC parameters and
+ * instrumentation information required to perform parallel
+ * operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char		*paramsdata;
+	char		*paramsexecdata;
+	char		*inst_options_space;
+	char		*buffer_usage_space;
+	int			*instoptions;
+
+	if (params)
+	{
+		paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+		*params = RestoreBoundParams(paramsdata);
+	}
+
+	if (serialized_param_exec_vals)
+	{
+		paramsexecdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS_EXEC);
+		*serialized_param_exec_vals = RestoreExecParams(paramsexecdata);
+	}
+
+	if (inst_options)
+	{
+		instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+		*inst_options = *instoptions;
+		if (inst_options)
+		{
+			inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+			*instrument = (inst_options_space +
+				ParallelWorkerNumber * sizeof(Instrumentation));
+		}
+	}
+
+	if (buffer_usage)
+	{
+		buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+		*buffer_usage = (buffer_usage_space +
+					 ParallelWorkerNumber * sizeof(BufferUsage));
+	}
+}
+
+/*
+ * ExecParallelGetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the planned statement required to perform
+ * parallel operation.
+ */
+void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_node_funcids((*plannedstmt)->planTree);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment
+ * and get the tuple queue information for a particular worker,
+ * attach to the queue and redirect all futher responses from
+ * worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * GetParallelShmToc
+ */
+shm_toc *
+GetParallelShmToc(void)
+{
+	return parallel_shm_toc;
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information
+ * to parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	int				inst_options;
+	char			*instrument = NULL;
+	char			*buffer_usage = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	ExecParallelGetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &serialized_param_exec_vals,
+						   &inst_options, &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->serialized_param_exec_vals = serialized_param_exec_vals;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->responseq = responseq;
+
+	parallel_shm_toc = toc;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from
+	 * shared memory message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
+
+/*
+ * ExecParallelBufferUsageAccum
+ *
+ * Recursively accumulate the stats for all the funnel nodes
+ * in a plan state tree.
+ */
+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_FunnelState:
+			{
+				FinishParallelSetupAndAccumStats((FunnelState*)node);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
+
+	(void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree, NULL,
+								 ExecParallelBufferUsageAccum, 0);
+	(void) planstate_tree_walker((Node*)((PlanState *)node)->righttree, NULL,
+								 ExecParallelBufferUsageAccum, 0);
+	return false;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 03c2feb..e24a439 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,8 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -196,6 +198,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -416,6 +428,14 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -658,6 +678,14 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a05d8b1..d5619bd 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1313,7 +1313,7 @@ do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull)
 	ExecStoreVirtualTuple(slot);
 
 	/* send the tuple to the receiver */
-	(*tstate->dest->receiveSlot) (slot, tstate->dest);
+	(void) (*tstate->dest->receiveSlot) (slot, tstate->dest);
 
 	/* clean up */
 	ExecClearTuple(slot);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 3c611b9..27ca0fa 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -976,3 +976,28 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 
 	MemoryContextSwitchTo(oldcontext);
 }
+
+/*
+ * Populate the values of PARAM_EXEC parameters.
+ *
+ * This is used by worker backends to fill in the values
+ * of PARAM_EXEC parameters after fetching the same from
+ * dynamic shared memory.  This needs to be called before
+ * ExecutorRun.
+ */
+void
+PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals)
+{
+	ListCell	*lparam;
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].value =
+																param_val->value;
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].isnull =
+																param_val->isnull;
+	}
+}
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..863bd64 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -167,7 +167,7 @@ static Datum postquel_get_single_result(TupleTableSlot *slot,
 static void sql_exec_error_callback(void *arg);
 static void ShutdownSQLFunction(Datum arg);
 static void sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
 static void sqlfunction_shutdown(DestReceiver *self);
 static void sqlfunction_destroy(DestReceiver *self);
 
@@ -1903,7 +1903,7 @@ sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * sqlfunction_receive --- receive one tuple
  */
-static void
+static bool
 sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_sqlfunction *myState = (DR_sqlfunction *) self;
@@ -1913,6 +1913,8 @@ sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Store the filtered tuple into the tuplestore */
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..283a136 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,30 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used
+ * to aggregate the information of worker backends.  We only
+ * need to sum the buffer usage and tuple count statistics as
+ * for other timing related statistics it is sufficient to
+ * have the master backend's information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +167,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..3c42f21
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,436 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		rescans a relation
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "executor/nodeSubplan.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+static void ExecAccumulateInstInfo(FunnelState *node);
+static void ExecAccumulateBufUsageInfo(FunnelState *node);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	 /* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution.
+	 * We do this on first execution rather than during node initialization,
+	 * as it needs to allocate large dynamic segement, so it is better to 
+	 * do if it is really needed.
+	 */
+	if (!node->pcxt)
+	{
+		EState		*estate = node->ss.ps.state;
+		ExprContext *econtext = node->ss.ps.ps_ExprContext;
+		bool		any_worker_launched = false;
+		List		*serialized_param_exec;
+
+		/*
+		 * Evaluate the InitPlan and pass the PARAM_EXEC params, so that
+		 * values can be shared with worker backend.  This is different
+		 * from the way InitPlans are evaluated (lazy evaluation) at other
+		 * places as instead of sharing the InitPlan to all the workers
+		 * and let them execute, we pass the values which can be directly
+		 * used by worker backends.
+		 */
+		serialized_param_exec = ExecAndFormSerializeParamExec(econtext,
+											node->ss.ps.plan->lefttree->allParam);
+
+		/* Initialize the workers required to execute funnel node. */
+		InitializeParallelWorkers(node->ss.ps.lefttree,
+								  serialized_param_exec,
+								  estate,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are
+		 * not available then we perform the scan with available workers and
+		 * If there are no more workers available, then the funnel node will
+		 * just scan locally.
+		 */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+	
+	slot = funnel_getnext(node);
+	
+	if (TupIsNull(slot))
+	{
+
+		/*
+		 * Destroy the parallel context once we complete fetching all
+		 * the tuples, this will ensure that if in the same statement
+		 * we need to have Funnel node for multiple parts of statement,
+		 * it won't accumulate lot of dsm's and workers can be made
+		 * available to use by other parts of statement.
+		 */
+		FinishParallelSetupAndAccumStats(node);
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+
+	FinishParallelSetupAndAccumStats(node);
+}
+
+/*
+ * funnel_getnext
+ *
+ *	Get the next tuple from shared memory queue.  This function
+ *	is reponsible for fetching tuples from all the queues associated
+ *	with worker backends used in funnel scan and if there is no
+ *  data available from queues or no worker is available, it does
+ *  fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while ((!funnelstate->all_workers_done  && funnelstate->fs_workersReady) ||
+			!funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *		FinishParallelSetupAndAccumStats
+ *
+ *		Destroy the setup for parallel workers.  Collect all the
+ *		stats after workers are stopped, else some work done by
+ *		workers won't be accounted.
+ * ----------------------------------------------------------------
+ */
+void
+FinishParallelSetupAndAccumStats(FunnelState *node)
+{
+	if (node->pcxt)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		if (node->fs_workersReady)
+		{
+			TupleQueueFunnelShutdown(node->funnel);
+			WaitForParallelWorkersToFinish(node->pcxt);
+		}
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+		node->funnel = NULL;
+
+		/*
+		 * Aggregate the buffer usage stats from all workers.  This is
+		 * required by external modules like pg_stat_statements.
+		 */
+		ExecAccumulateBufUsageInfo(node);
+
+		/*
+		 * Aggregate instrumentation information of all the backend
+		 * workers for Funnel node.  This has to be done before we
+		 * destroy the parallel context.
+		 */
+		if (node->ss.ps.state->es_instrument)
+			ExecAccumulateInstInfo(node);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+		node->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateInstInfo
+ *
+ *		Accumulate instrumentation information of all the workers
+ * ----------------------------------------------------------------
+ */
+void ExecAccumulateInstInfo(FunnelState *node)
+{
+	int i;
+	Instrumentation *instrument_worker;
+	int nworkers;
+	char *inst_info_workers;
+	
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		inst_info_workers = node->inst_options_space;
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(node->ss.ps.instrument, instrument_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateBufUsageInfo
+ *
+ *		Accumulate buffer usage information of all the workers
+ * ----------------------------------------------------------------
+ */
+void ExecAccumulateBufUsageInfo(FunnelState *node)
+{
+	int i;
+	int nworkers;
+	BufferUsage *buffer_usage_worker;
+	char *buffer_usage;
+
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		buffer_usage = node->buffer_usage_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * sizeof(BufferUsage)));
+			BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Rescans a relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.  We want to gracefully shutdown all the
+	 * workers so that they should be able to propagate any error
+	 * or other information to master backend before dying.
+	 */
+	FinishParallelSetupAndAccumStats(node);
+
+	ExecReScan(node->ss.ps.lefttree);
+}
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index e66bcda..c447062 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -144,6 +144,7 @@ ExecNestLoop(NestLoopState *node)
 			{
 				NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
 				int			paramno = nlp->paramno;
+				TupleDesc	tdesc = outerTupleSlot->tts_tupleDescriptor;
 				ParamExecData *prm;
 
 				prm = &(econtext->ecxt_param_exec_vals[paramno]);
@@ -154,6 +155,7 @@ ExecNestLoop(NestLoopState *node)
 				prm->value = slot_getattr(outerTupleSlot,
 										  nlp->paramval->varattno,
 										  &(prm->isnull));
+				prm->ptype = tdesc->attrs[nlp->paramval->varattno-1]->atttypid;
 				/* Flag parameter value as changed */
 				innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
 													 paramno);
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..09b7e07
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for parallel sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation.
+ *		PartialSeqNext					retrieve next tuple from either heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	shm_toc		*toc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.  We set 'toc' (place to lookup parallel scan
+	 * descriptor) as retrievied by attaching to dsm for parallel workers
+	 * whereas master backend stores it directly in partial scan state node
+	 * after initializing workers. 
+	 */
+	toc = GetParallelShmToc();
+	if (toc)
+		node->ss.ps.toc = toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize
+	 * it during ExecutorStart phase, however we need ParallelHeapScanDesc
+	 * to initialize the scan in case of this node and the same is
+	 * initialized by the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic shared
+		 * memory segment by master backend, parallel workers and local scan by
+		 * master backend retrieve it from shared memory.  If the scan descriptor
+		 * is available on first execution, then we need to re-initialize for
+		 * rescan.
+		 */
+		Assert(node->ss.ps.toc);
+	
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b348bfd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -75,6 +75,13 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Result node can be added as a gating node on top of PartialSeqScan
+	 * node, so need to percolate toc information to outer node.
+	 */
+	if (node->ps.toc)
+		outerPlanState(node)->toc = node->ps.toc;
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9eb4d63..6afd55a 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -30,11 +30,14 @@
 #include <math.h>
 
 #include "access/htup_details.h"
+#include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeSubplan.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
@@ -281,12 +284,14 @@ ExecScanSubPlan(SubPlanState *node,
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -399,6 +404,7 @@ ExecScanSubPlan(SubPlanState *node,
 			prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col, &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 
@@ -551,6 +557,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		 !TupIsNull(slot);
 		 slot = ExecProcNode(planstate))
 	{
+		TupleDesc	tdesc = slot->tts_tupleDescriptor;
 		int			col = 1;
 		ListCell   *plst;
 		bool		isnew;
@@ -568,6 +575,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col,
 										  &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 		slot = ExecProject(node->projRight, NULL);
@@ -954,6 +962,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *l;
 	bool		found = false;
 	ArrayBuildStateAny *astate = NULL;
+	Oid			ptype;
 
 	if (subLinkType == ANY_SUBLINK ||
 		subLinkType == ALL_SUBLINK)
@@ -961,6 +970,8 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	if (subLinkType == CTE_SUBLINK)
 		elog(ERROR, "CTE subplans should not be executed via ExecSetParamPlan");
 
+	ptype = exprType((Node *) node->xprstate.expr);
+
 	/* Initialize ArrayBuildStateAny in caller's context, if needed */
 	if (subLinkType == ARRAY_SUBLINK)
 		astate = initArrayResultAny(subplan->firstColType,
@@ -983,12 +994,14 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -1011,6 +1024,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(true);
+			prm->ptype = ptype;
 			prm->isnull = false;
 			found = true;
 			break;
@@ -1062,6 +1076,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 			prm->execPlan = NULL;
 			prm->value = heap_getattr(node->curTuple, i, tdesc,
 									  &(prm->isnull));
+			prm->ptype = tdesc->attrs[i-1]->atttypid;
 			i++;
 		}
 	}
@@ -1084,6 +1099,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 											true);
 		prm->execPlan = NULL;
 		prm->value = node->curArray;
+		prm->ptype = ptype;
 		prm->isnull = false;
 	}
 	else if (!found)
@@ -1096,6 +1112,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(false);
+			prm->ptype = ptype;
 			prm->isnull = false;
 		}
 		else
@@ -1108,6 +1125,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 				prm->execPlan = NULL;
 				prm->value = (Datum) 0;
+				prm->ptype = VOIDOID;
 				prm->isnull = true;
 			}
 		}
@@ -1238,3 +1256,47 @@ ExecAlternativeSubPlan(AlternativeSubPlanState *node,
 					   isNull,
 					   isDone);
 }
+
+/*
+ * ExecAndFormSerializeParamExec
+ *
+ * Execute the subplan stored in PARAM_EXEC param if it is not executed
+ * till now and form the serialized structure required for passing to
+ * worker backend.
+ */
+List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params)
+{
+	List	*lparam = NIL;
+	SerializedParamExecData *sparamdata;
+	ParamExecData *prm;
+	int		paramid;
+
+	paramid = -1;
+	while ((paramid = bms_next_member(params, paramid)) >= 0)
+	{
+		/*
+		 * PARAM_EXEC params (internal executor parameters) are stored in the
+		 * ecxt_param_exec_vals array, and can be accessed by array index.
+		 */
+		sparamdata = palloc0(sizeof(SerializedParamExecData));
+
+		prm = &(econtext->ecxt_param_exec_vals[paramid]);
+		if (prm->execPlan != NULL)
+		{
+			/* Parameter not evaluated yet, so go do it */
+			ExecSetParamPlan(prm->execPlan, econtext);
+			/* ExecSetParamPlan should have processed this param... */
+			Assert(prm->execPlan == NULL);
+		}
+
+		sparamdata->paramid	= paramid;
+		sparamdata->ptype = prm->ptype;
+		sparamdata->value = prm->value;
+		sparamdata->isnull = prm->isnull;
+
+		lparam = lappend(lparam, sparamdata);
+	}
+
+	return lparam;
+}
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index d544ad9..d8ca074 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1774,7 +1774,7 @@ spi_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		store tuple retrieved by Executor into SPITupleTable
  *		of current SPI procedure
  */
-void
+bool
 spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	SPITupleTable *tuptable;
@@ -1808,6 +1808,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 	(tuptable->free)--;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..39acda7
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,304 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static bool
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result == SHM_MQ_DETACHED)
+		return false;
+	else if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+
+	return true;
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Detach all tuple queues that belong to funnel.
+ */
+void
+TupleQueueFunnelShutdown(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		int		i;
+		shm_mq_handle *mqh;
+		shm_mq	   *mq;
+		for (i = 0; i < funnel->nqueues; i++)
+		{
+			mqh = funnel->queue[i];
+			mq = shm_mq_get_queue(mqh);
+			shm_mq_detach(mq);
+		}
+	}
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is until the number of remaining queues fall below that pointer and
+		 * at that point make the pointer point to the first queue.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+
+			memmove(&funnel->queue[funnel->nextqueue],
+					&funnel->queue[funnel->nextqueue + 1],
+					sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+
+			if (funnel->nextqueue >= funnel->nqueues)
+				funnel->nextqueue = 0;
+
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+
+			continue;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/executor/tstoreReceiver.c b/src/backend/executor/tstoreReceiver.c
index c1fdeb7..b0862ae 100644
--- a/src/backend/executor/tstoreReceiver.c
+++ b/src/backend/executor/tstoreReceiver.c
@@ -37,8 +37,8 @@ typedef struct
 } TStoreState;
 
 
-static void tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
-static void tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
 
 
 /*
@@ -90,19 +90,21 @@ tstoreStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the easy case where we don't have to detoast.
  */
-static void
+static bool
 tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
 
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the case where we have to detoast any toasted values.
  */
-static void
+static bool
 tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
@@ -152,6 +154,8 @@ tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 	/* And release any temporary detoasted values */
 	for (i = 0; i < nfree; i++)
 		pfree(DatumGetPointer(myState->tofree[i]));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 5bfb7c0..578b39a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -361,6 +361,43 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4226,6 +4263,12 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index b1e3e6e..e610d9d 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3399,3 +3399,25 @@ raw_expression_tree_walker(Node *node,
 	}
 	return false;
 }
+
+/*
+ * planstate_tree_walker
+ *
+ * This routine will invoke walker on the node passed.  This is a useful
+ * way of starting the recursion when the walker's normal change of state
+ * is not appropriate for the outermost PlanState node.
+ */
+bool
+planstate_tree_walker(Node *node,
+					  ParallelContext *pcxt,
+					  bool (*walker) (),
+					  void *context)
+{
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex plan */
+	check_stack_depth();
+
+	return walker(node, pcxt, context);
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 482354c..f8dee8e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -446,6 +446,24 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -3005,6 +3023,12 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..0050195 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, same as in original Param */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,355 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+						num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
+
+/*
+ * Estimate the amount of space required to serialize the PARAM_EXEC
+ * parameters.
+ */
+Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals)
+{
+	Size		size;
+	ListCell	*lparam;
+
+	/*
+	 * Add space required for saving number of PARAM_EXEC parameters
+	 * that needs to be serialized.
+	 */
+	size = sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		length;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		length = sizeof(SerializedParamExecData);
+
+		get_typlenbyval(param_val->ptype, &typLen, &typByVal);
+
+		/*
+		 * pass-by-value parameters are directly stored in
+		 * SerializedParamExternData, so no need of additional
+		 * space for them.
+		 */
+		if (!(typByVal || param_val->isnull))
+		{
+			length += datumGetSize(param_val->value, typByVal, typLen);
+			size = add_size(size, length);
+
+			/* Allow space for terminating zero-byte */
+			size = add_size(size, 1);
+		}
+		else
+			size = add_size(size, length);
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the PARAM_EXEC parameters into the memory, beginning at
+ * start_address.  maxsize should be at least as large as the value
+ * returned by EstimateExecParametersSpace.
+ */
+void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExecData *retval;
+	ListCell	*lparam;
+
+	/*
+	 * First, we store the number of PARAM_EXEC parameters that needs to
+	 * be serialized.
+	 */
+	if (serialized_param_exec_vals)
+		* (int *) start_address = list_length(serialized_param_exec_vals);
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+
+	curptr = start_address + sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		retval = (SerializedParamExecData*) curptr;
+
+		retval->paramid	= param_val->paramid;
+		retval->value = param_val->value;
+		retval->isnull = param_val->isnull;
+		retval->ptype = param_val->ptype;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(retval->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(retval->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(retval->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreExecParams
+ *		Restore PARAM_EXEC parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+List *
+RestoreExecParams(char *start_address)
+{
+	List			*lparamexecvals = NIL;
+	//Size			size;
+	int				num_params,i;
+	char			*curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExecData *nprm;
+		SerializedParamExecData	*outparam;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExecData *) curptr;
+
+		outparam = palloc0(sizeof(SerializedParamExecData));
+
+		/* copy the parameter info */
+		outparam->isnull = nprm->isnull;
+		outparam->value = nprm->value;
+		outparam->paramid = nprm->paramid;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			outparam->value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+
+		lparamexecvals = lappend(lparamexecvals, outparam);
+	}
+
+	return lparamexecvals;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index f5a40fb..e3357cc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -29,6 +29,7 @@
 #include <math.h>
 
 #include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
 #include "nodes/readfuncs.h"
 
 
@@ -1391,6 +1392,124 @@ _readRangeTblFunction(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlan
+ */
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readScan
+ */
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
+
+/*
+ * _readResult
+ */
+static Result *
+_readResult(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(Result);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_NODE_FIELD(resconstantqual);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1532,6 +1651,14 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
+	else if (MATCH("RESULT", 6))
+		return_value = _readResult();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0b83189..fa6bbdf 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -471,6 +471,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0d302f6..8b6b46c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,8 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -101,11 +103,15 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -287,6 +293,86 @@ cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel)
 }
 
 /*
+ * cost_patialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_patialseqscan(Path *path, PlannerInfo *root,
+				   RelOptInfo *baserel, ParamPathInfo *param_info,
+				   int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan
+	 * via the ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * partial sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_funnel
+ *	  Determines and returns the cost of funnel path.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..bc71737
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine which conditions are usable for scanning
+ *	  a given relation, and create ParallelPaths accordingly.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support parallel sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set
+	 * parallel_seqscan_degree to value greater than 0
+	 * and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast thousand pages to scan for each worker.
+	 * This number is somewhat arbitratry, however we don't want to
+	 * spawn workers to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+	
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	if (parallel_seqscan_degree <= estimated_parallel_workers)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = estimated_parallel_workers;
+
+	/* Create the partial scan path which each worker needs to execute. */
+	subpath = create_partialseqscan_path(root, rel, false,
+										 num_parallel_workers);
+
+	/* Create the parallel scan path which master needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index dc2dcbf..1e2b3ba 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,10 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						List *tlist, List *scan_clauses);
+static Scan *create_funnel_plan(PlannerInfo *root,
+						FunnelPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -103,6 +107,11 @@ static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+									Index scanrelid);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+					Index scanrelid, int nworkers,
+					Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -233,6 +242,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -272,6 +282,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 			plan = create_unique_plan(root,
 									  (UniquePath *) best_path);
 			break;
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
 				 (int) best_path->pathtype);
@@ -355,6 +369,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												   scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -559,6 +580,8 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1186,6 +1209,107 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path'.
+ */
+static Scan *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path)
+{
+	Scan    *scan_plan;
+	Plan	*subplan;
+	List	*tlist;
+	RelOptInfo *rel = best_path->path.parent;
+	Index	scan_relid = best_path->path.parent->relid;
+
+	/*
+	 * For table scans, rather than using the relation targetlist (which is
+	 * only those Vars actually needed by the query), we prefer to generate a
+	 * tlist containing all Vars in order.  This will allow the executor to
+	 * optimize away projection of the table tuples, if possible.  (Note that
+	 * planner.c may replace the tlist we generate here, forcing projection to
+	 * occur.)
+	 */
+	if (use_physical_tlist(root, rel))
+	{
+			tlist = build_physical_tlist(root, rel);
+			/* if fail because of dropped cols, use regular method */
+			if (tlist == NIL)
+				tlist = build_path_tlist(root, &best_path->path);
+	}
+	else
+	{
+		tlist = build_path_tlist(root, &best_path->path);
+	}
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	scan_plan = (Scan *) make_funnel(tlist,
+									 subplan->qual,
+									 scan_relid,
+									 best_path->num_workers,
+									 subplan);
+
+	copy_path_costsize(&scan_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return scan_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3452,6 +3576,45 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7065e39..641b05f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -293,6 +293,52 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+								   List *rangetable,
+								   int num_exec_params)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, partialscan->plan.targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = (Plan*) partialscan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = num_exec_params;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 820f69d..90f4dfb 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -440,6 +440,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -461,6 +462,26 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_list(root, splan->plan.qual, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				/*
+				 * target list for partial sequence scan (leftree of funnel plan)
+				 * should be same as for funnel scan as both nodes need to produce
+				 * same projection.  We don't want to do this assignment after
+				 * fixing references as that will be done separately for partial
+				 * sequence scan node.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
@@ -2259,6 +2280,45 @@ fix_opfuncids_walker(Node *node, void *context)
 }
 
 /*
+ * fix_node_funcids
+ *		Set the opfuncid (procedure OID) in an OpExpr node,
+ *		for plan tree.
+ *
+ * We need it mainly to fix the opfuncid in nodes of plantree
+ * after reading the planned statement by worker backend.
+ * Currently the support of nodes that could be executed by
+ * worker backend are limited, so we can enhance this API based
+ * on it's usage in future.
+ */
+void
+fix_node_funcids(Plan *node)
+{
+	/*
+	 * do nothing when we get to the end of a leaf on tree.
+	 */
+	if (node == NULL)
+		return;
+
+	fix_opfuncids((Node*) node->qual);
+	fix_opfuncids((Node*) node->targetlist);
+
+	switch (nodeTag(node))
+	{
+		case T_Result:
+			fix_opfuncids((Node*) (((Result *)node)->resconstantqual));
+			break;
+		case T_PartialSeqScan:
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	fix_node_funcids(node->lefttree);
+	fix_node_funcids(node->righttree);
+}
+
+/*
  * set_opfuncid
  *		Set the opfuncid (procedure OID) in an OpExpr node,
  *		if it hasn't been set already.
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 4708b87..92e9326 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2217,6 +2217,8 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
+		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f7f33bb..81ca06c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -732,6 +732,54 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_patialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel,
+				   Path* subpath, int nworkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+													 false);
+	pathnode->path.pathkeys = NIL;	/* seqscan has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nworkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..4aec92a 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+	pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index df8037b..8729bef 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -848,6 +849,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 126cb07..e482269 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -746,6 +746,15 @@ shm_mq_detach(shm_mq *mq)
 }
 
 /*
+ * Get the shm_mq from handle.
+ */
+shm_mq *
+shm_mq_get_queue(shm_mq_handle *mqh)
+{
+	return mqh->mqh_queue;
+}
+
+/*
  * Write bytes into a shared message queue.
  */
 static shm_mq_result
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..57014ee 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -44,9 +45,10 @@
  *		dummy DestReceiver functions
  * ----------------
  */
-static void
+static bool
 donothingReceive(TupleTableSlot *slot, DestReceiver *self)
 {
+	return true;
 }
 
 static void
@@ -129,6 +131,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +167,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +210,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +255,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7598318..f1542a0 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,8 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/execParallel.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -1192,6 +1194,94 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										"worker plan",
+										ALLOCSET_DEFAULT_MINSIZE,
+										ALLOCSET_DEFAULT_INITSIZE,
+										ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	receiver = CreateDestReceiver(DestTupleQueue);
+	SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	PopulateParamExecParams(queryDesc, parallelstmt->serialized_param_exec_vals);
+
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required
+	 * by plugins like pg_stat_statements to report the total usage for
+	 * statement execution.
+	 */
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..f2fb638 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1121,7 +1121,13 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			if (!ok)
 				break;
 
-			(*dest->receiveSlot) (slot, dest);
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
 
 			ExecClearTuple(slot);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 595a609..c4e9531 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -607,6 +607,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2545,6 +2547,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2732,6 +2744,26 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 06dfc06..32ff938 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -291,6 +291,8 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -501,6 +503,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 31139cb..d56e839 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -121,9 +122,15 @@ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/printtup.h b/src/include/access/printtup.h
index 46c4148..92ec882 100644
--- a/src/include/access/printtup.h
+++ b/src/include/access/printtup.h
@@ -25,11 +25,11 @@ extern void SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist,
 
 extern void debugStartup(DestReceiver *self, int operation,
 			 TupleDesc typeinfo);
-extern void debugtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool debugtup(TupleTableSlot *slot, DestReceiver *self);
 
 /* XXX these are really in executor/spi.c */
 extern void spi_dest_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-extern void spi_printtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool spi_printtup(TupleTableSlot *slot, DestReceiver *self);
 
 #endif   /* PRINTTUP_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index f2482e9..90af7e1 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,15 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber	phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +58,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel; /* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
new file mode 100644
index 0000000..73006a8
--- /dev/null
+++ b/src/include/executor/execParallel.h
@@ -0,0 +1,65 @@
+/*--------------------------------------------------------------------
+ * execParallel.h
+ *		POSTGRES backend workers interface
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execParallel.h
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECPARALLEL_H
+#define EXECPARALLEL_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+#include "nodes/execnodes.h"
+#include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define	PARALLEL_KEY_PARAMS_EXEC	2
+#define PARALLEL_KEY_BUFF_USAGE		3
+#define PARALLEL_KEY_INST_OPTIONS	4
+#define PARALLEL_KEY_INST_INFO		5
+#define PARALLEL_KEY_TUPLE_QUEUE	6
+#define PARALLEL_KEY_SCAN			7
+
+extern int	parallel_seqscan_degree;
+
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+	char			*buffer_usage;
+} ParallelStmt;
+
+extern void InitializeParallelWorkers(PlanState *planstate,
+									  List *serialized_param_exec_vals,
+									  EState *estate,
+									  char **inst_options_space,
+									  char **buffer_usage_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+extern shm_toc *GetParallelShmToc(void);
+extern bool ExecParallelEstimate(Node *node, ParallelContext *pcxt,
+								 Size *pscan_size);
+extern bool ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
+									  Size *pscan_size);
+extern bool ExecParallelBufferUsageAccum(Node *node);
+extern void ExecAssociateBufferStatsToDSM(BufferUsage *buf_usage,
+							  ParallelStmt *parallel_stmt);
+#endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 193a654..963e656 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -273,6 +273,8 @@ extern TupleDesc ExecCleanTypeFromTL(List *targetList, bool hasoid);
 extern TupleDesc ExecTypeFromExprList(List *exprList);
 extern void ExecTypeSetColNames(TupleDesc typeInfo, List *namesList);
 extern void UpdateChangedParamSet(PlanState *node, Bitmapset *newchg);
+extern void PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals);
 
 typedef struct TupOutputState
 {
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c9a2129..0c7847d 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+	InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..27d0b3d
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void FinishParallelSetupAndAccumStats(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..47b8f73
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+											EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index 3732ad4..21c745e 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -24,4 +24,7 @@ extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
 
 extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
 
+extern List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params);
+
 #endif   /* NODESUBPLAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..d2ddb6e
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void TupleQueueFunnelShutdown(TupleQueueFunnel *funnel);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 541ee18..cc1174e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -401,6 +403,18 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
+	 * This is required to collect buffer usage stats from parallel
+	 * workers when requested by plugins.
+	 */
+	bool		total_time;	/* total time spent in ExecutorRun */
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1050,6 +1064,11 @@ typedef struct PlanState
 	 * State for management of parameter-change-driven rescanning
 	 */
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
+	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc			*toc;
 
 	/*
 	 * Other run-time state needed by most if not all node types.
@@ -1267,6 +1286,45 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+/*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	char			*buffer_usage_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodeFuncs.h b/src/include/nodes/nodeFuncs.h
index 7b1b1d6..df00d3d 100644
--- a/src/include/nodes/nodeFuncs.h
+++ b/src/include/nodes/nodeFuncs.h
@@ -13,6 +13,7 @@
 #ifndef NODEFUNCS_H
 #define NODEFUNCS_H
 
+#include "access/parallel.h"
 #include "nodes/parsenodes.h"
 
 
@@ -63,4 +64,7 @@ extern Node *query_or_expression_tree_mutator(Node *node, Node *(*mutator) (),
 extern bool raw_expression_tree_walker(Node *node, bool (*walker) (),
 												   void *context);
 
+extern bool planstate_tree_walker(Node *node, ParallelContext *pcxt,
+					  bool (*walker) (), void *context);
+
 #endif   /* NODEFUNCS_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 290cdb3..322b5e8 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,8 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -99,6 +101,8 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_PartialSeqScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -223,6 +227,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..21c6f7a 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -14,6 +14,8 @@
 #ifndef PARAMS_H
 #define PARAMS_H
 
+#include "nodes/pg_list.h"
+
 /* To avoid including a pile of parser headers, reference ParseState thus: */
 struct ParseState;
 
@@ -96,11 +98,47 @@ typedef struct ParamExecData
 {
 	void	   *execPlan;		/* should be "SubPlanState *" */
 	Datum		value;
+	/*
+	 * parameter's datatype, or 0.  This is required so that
+	 * datum value can be read and used for other purposes like
+	 * passing it to worker backend via shared memory.  This is
+	 * required only for evaluation of initPlan's, however for
+	 * consistency we set this for Subplan as well.  We left it
+	 * for other cases like CTE or RecursiveUnion cases where this
+	 * structure is not used for evaluation of subplans.
+	 */
+	Oid			ptype;
 	bool		isnull;
 } ParamExecData;
 
+/*
+ * This structure is used to pass PARAM_EXEC parameters to backend
+ * workers.  For each PARAM_EXEC parameter, pass this structure
+ * followed by value except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExecData
+{
+	int			paramid;			/* parameter id of this param */
+	Size		length;			/* length of parameter value */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+	Datum		value;
+	bool		isnull;
+} SerializedParamExecData;
+
 
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
+extern Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals);
+extern void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address);
+List *
+RestoreExecParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 761bdf4..5d705c1 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -292,6 +292,22 @@ typedef Scan SeqScan;
 typedef Scan SampleScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
+ *		parallel sequential scan node
+ * ----------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
+/* ----------------
  *		index scan node
  *
  * indexqualorig is an implicitly-ANDed list of index qual expressions, each
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 53a8820..62b498b 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -754,6 +754,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 24003ae..a1c9f59 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,13 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +55,11 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -69,6 +79,11 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
+extern void cost_patialseqscan(Path *path, PlannerInfo *root,
+						RelOptInfo *baserel, ParamPathInfo *param_info,
+						int nworkers);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 161644c..6047fec 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,10 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						Relids required_outer, int nworkers);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 3e2378a..bd8eb67 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,13 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 52b077a..67a8582 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -133,6 +133,7 @@ extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
  */
 extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
 extern void fix_opfuncids(Node *node);
+extern void fix_node_funcids(Plan *node);
 extern void set_opfuncid(OpExpr *opexpr);
 extern void set_sa_opfuncid(ScalarArrayOpExpr *opexpr);
 extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index b10a504..8c7ce75 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(PartialSeqScan *partialscan,
+											List *rangetable, int num_exec_params);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index 1a2ba04..7621a35 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -65,6 +65,9 @@ extern void shm_mq_set_handle(shm_mq_handle *, BackgroundWorkerHandle *);
 /* Break connection. */
 extern void shm_mq_detach(shm_mq *);
 
+/* Get the shm_mq from handle. */
+extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
+
 /* Send or receive messages. */
 extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
 			Size nbytes, const void *data, bool nowait);
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..91acd60 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
@@ -103,7 +104,9 @@ typedef enum
  *		pointers that the executor must call.
  *
  * Note: the receiveSlot routine must be passed a slot containing a TupleDesc
- * identical to the one given to the rStartup routine.
+ * identical to the one given to the rStartup routine.  It returns bool where
+ * a "true" value means "continue processing" and a "false" value means
+ * "stop early, just as if we'd reached the end of the scan".
  * ----------------
  */
 typedef struct _DestReceiver DestReceiver;
@@ -111,7 +114,7 @@ typedef struct _DestReceiver DestReceiver;
 struct _DestReceiver
 {
 	/* Called for each tuple to be output: */
-	void		(*receiveSlot) (TupleTableSlot *slot,
+	bool		(*receiveSlot) (TupleTableSlot *slot,
 											DestReceiver *self);
 	/* Per-executor-run initialization and shutdown: */
 	void		(*rStartup) (DestReceiver *self,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 96c5b8b..6f319c1 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -19,6 +19,7 @@
 #ifndef TCOPPROT_H
 #define TCOPPROT_H
 
+#include "executor/execParallel.h"
 #include "nodes/params.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -84,5 +85,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7a58ddb..3505d31 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
#288Jeff Davis
pgsql@j-davis.com
In reply to: Amit Kapila (#287)
Re: Parallel Seq Scan

On Fri, 2015-07-03 at 17:35 +0530, Amit Kapila wrote:

Attached, find the rebased version of patch.

Comments:

* The heapam.c changes seem a little ad-hoc. Conceptually, which
portions should be affected by parallelism? How do we know we didn't
miss something?
* Why is initscan getting the number of blocks from the structure? Is it
just to avoid an extra syscall, or is there a correctness issue there?
Is initscan expecting that heap_parallelscan_initialize is always called
first (if parallel)? Please add a comment explaining above.
* What's the difference between scan->rs_nblocks and
scan->rs_parallel->phs_nblocks? Same for rs_rd->rd_id and phs_relid.
* It might be good to separate out some fields which differ between the
normal heap scan and the parallel heap scan. Perhaps put rs_ctup,
rs_cblock, and rs_cbuf into a separate structure, which is always NULL
during a parallel scan. That way we don't accidentally use a
non-parallel field when doing a parallel scan.
* Is there a reason that partial scans can't work with syncscan? It
looks like you're not choosing the starting block in the same way, so it
always starts at zero and never does syncscan. If we don't want to mix
syncscan and partial scan, that's fine, but it should be more explicit.

I'm trying to understand where tqueue.c fits in. It seems very closely
tied to the Funnel operator, because any change to the way Funnel works
would almost certainly require changes in tqueue.c. But "tqueue" is a
generic name for the file, so something seems off. Either we should
explicitly make it the supporting routines for the Funnel operator, or
we should try to generalize it a little.

I still have quite a bit to look at, but this is a start.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#289Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Davis (#288)
Re: Parallel Seq Scan

On Mon, Jul 6, 2015 at 3:26 AM, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2015-07-03 at 17:35 +0530, Amit Kapila wrote:

Attached, find the rebased version of patch.

Comments:

* The heapam.c changes seem a little ad-hoc. Conceptually, which
portions should be affected by parallelism? How do we know we didn't
miss something?

The main reason of changing heapam.c is that we want to scan blocks
parallely by multiple workers and heapam.c seems to be the best
place to make such a change. As of now, the changes are mainly
required to identify the next block to scan by each worker. So
we can focus on that aspect and see if anything is missing.

* Why is initscan getting the number of blocks from the structure? Is it
just to avoid an extra syscall, or is there a correctness issue there?

Yes, there is a correctness issue. All the parallel workers should see
the same scan information during scan as is seen by master backend.
master backend fills this structure and then that is used by all workers
to avoid any problem.

Is initscan expecting that heap_parallelscan_initialize is always called
first (if parallel)? Please add a comment explaining above.

okay.

* What's the difference between scan->rs_nblocks and
scan->rs_parallel->phs_nblocks?

scan->rs_parallel->phs_nblocks is once initialized in master
backend and then propagated to all other worker backends and
then worker backends use that value to initialize scan->rs_nblocks
(and if master backend itself is involved in scan, then it also
uses it in same way)

Same for rs_rd->rd_id and phs_relid.

This is also similar to phs_nblocks. The basic idea is that parallel
heap scan descriptor is formed in master backend containing all the
necessary members that are required for performing the scan in master
as well as worker backends. Once we initialize the parallel heap scan
descriptor, it is passed to all the worker backends and used by them
to scan the heap.

* It might be good to separate out some fields which differ between the
normal heap scan and the parallel heap scan. Perhaps put rs_ctup,
rs_cblock, and rs_cbuf into a separate structure, which is always NULL
during a parallel scan. That way we don't accidentally use a
non-parallel field when doing a parallel scan.

Or the other way to look at it could be separate out fields which are
required for parallel scan which is done currently by forming a
separate structure ParallelHeapScanDescData.

* Is there a reason that partial scans can't work with syncscan? It
looks like you're not choosing the starting block in the same way, so it
always starts at zero and never does syncscan.

The reason why partial scan can't be mixed with sync scan is that in
parallel
scan, it performs the scan of heap by synchronizing blocks (each parallel
worker
scans a block and then asks for a next block to scan) among parallel
workers.
Now if we try to make sync scans work along with it, the synchronization
among
parallel workers will go for a toss. It might not be impossible to make
that
work in some way, but not sure if it is important enough for sync scans to
work
along with parallel scan.

If we don't want to mix
syncscan and partial scan, that's fine, but it should be more explicit.

makes sense to me, I think in initscan, we should mark syncscan
as false for parallel scan case.

I'm trying to understand where tqueue.c fits in. It seems very closely
tied to the Funnel operator, because any change to the way Funnel works
would almost certainly require changes in tqueue.c.

tqueue.c is mainly designed to pass tuples between parallel workers
and currently it is used in Funnel operator to gather the tuples generated
by all the parallel workers. I think we can use it for any other operator
which needs tuple communication among parallel workers.

But "tqueue" is a
generic name for the file, so something seems off. Either we should
explicitly make it the supporting routines for the Funnel operator, or
we should try to generalize it a little.

It has been designed to be generic way of communication for tuples,
but let me know if you have any specific suggestions.

I still have quite a bit to look at, but this is a start.

Thanks for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#290Jeff Davis
pgsql@j-davis.com
In reply to: Amit Kapila (#289)
Re: Parallel Seq Scan

On Mon, 2015-07-06 at 10:37 +0530, Amit Kapila wrote:

Or the other way to look at it could be separate out fields which are
required for parallel scan which is done currently by forming a
separate structure ParallelHeapScanDescData.

I was suggesting that you separate out both the normal scan fields and
the partial scan fields, that way we're sure that rs_nblocks is not
accessed during a parallel scan.

Or, you could try wrapping the parts of heapam.c that are affected by
parallelism into new static functions.

The reason why partial scan can't be mixed with sync scan is that in
parallel
scan, it performs the scan of heap by synchronizing blocks (each
parallel worker
scans a block and then asks for a next block to scan) among parallel
workers.
Now if we try to make sync scans work along with it, the
synchronization among
parallel workers will go for a toss. It might not be impossible to
make that
work in some way, but not sure if it is important enough for sync
scans to work
along with parallel scan.

I haven't tested it, but I think it would still be helpful. The block
accesses are still in order even during a partial scan, so why wouldn't
it help?

You might be concerned about the reporting of a block location, which
would become more noisy with increased parallelism. But in my original
testing, sync scans weren't very sensitive to slight deviations, because
of caching effects.

tqueue.c is mainly designed to pass tuples between parallel workers
and currently it is used in Funnel operator to gather the tuples
generated
by all the parallel workers. I think we can use it for any other
operator
which needs tuple communication among parallel workers.

Some specifics of the Funnel operator seem to be a part of tqueue, which
doesn't make sense to me. For instance, reading from the set of queues
in a round-robin fashion is part of the Funnel algorithm, and doesn't
seem suitable for a generic tuple communication mechanism (that would
never allow order-sensitive reading, for example).

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#291Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#287)
Re: Parallel Seq Scan

On Fri, Jul 3, 2015 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached, find the rebased version of patch.

Note - You need to first apply the assess-parallel-safety patch which you
can find at:
/messages/by-id/CAA4eK1JjsfE_dOsHTr_z1P_cBKi_X4C4X3d7Nv=VWX9fs7qdJA@mail.gmail.com

I ran some performance tests on a 16 core machine with large shared
buffers, so there is no IO involved.
With the default value of cpu_tuple_comm_cost, parallel plan is not
getting generated even if we are selecting 100K records from 40
million records. So I changed the value to '0' and collected the
performance readings.

Here are the performance numbers:

selectivity(millions) Seq scan(ms) Parallel scan
2 workers
4 workers 8 workers
0.1 11498.93 4821.40
3305.84 3291.90
0.4 10942.98 4967.46
3338.58 3374.00
0.8 11619.44 5189.61
3543.86 3534.40
1.5 12585.51 5718.07
4162.71 2994.90
2.7 14725.66 8346.96
10429.05 8049.11
5.4 18719.00 20212.33 21815.19
19026.99
7.2 21955.79 28570.74 28217.60
27042.27

The average table row size is around 500 bytes and query selection
column width is around 36 bytes.
when the query selectivity goes more than 10% of total table records,
the parallel scan performance is dropping.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#292Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Davis (#290)
Re: Parallel Seq Scan

On Mon, Jul 6, 2015 at 10:54 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2015-07-06 at 10:37 +0530, Amit Kapila wrote:

Or the other way to look at it could be separate out fields which are
required for parallel scan which is done currently by forming a
separate structure ParallelHeapScanDescData.

I was suggesting that you separate out both the normal scan fields and
the partial scan fields, that way we're sure that rs_nblocks is not
accessed during a parallel scan.

In patch rs_nblocks is used in paratial scan's as well, only the
way to initialize is changed.

Or, you could try wrapping the parts of heapam.c that are affected by
parallelism into new static functions.

Sounds sensible to me, but I would like to hear from Robert before
making this change, if he has any different opinions about this point, as
he has originally written this part of the patch.

The reason why partial scan can't be mixed with sync scan is that in
parallel
scan, it performs the scan of heap by synchronizing blocks (each
parallel worker
scans a block and then asks for a next block to scan) among parallel
workers.
Now if we try to make sync scans work along with it, the
synchronization among
parallel workers will go for a toss. It might not be impossible to
make that
work in some way, but not sure if it is important enough for sync
scans to work
along with parallel scan.

I haven't tested it, but I think it would still be helpful. The block
accesses are still in order even during a partial scan, so why wouldn't
it help?

You might be concerned about the reporting of a block location, which
would become more noisy with increased parallelism. But in my original
testing, sync scans weren't very sensitive to slight deviations, because
of caching effects.

I am not sure how many blocks difference could be considered okay for
deviation?
In theory, making parallel scan perform sync scan could lead to difference
of multiple blocks, consider the case where there are 32 or more workers
participating in scan and each got one block to scan, it is possible that
first worker performs scan of 1st block after 32nd worker performs the
scan of 32nd block (it could lead to even bigger differences).

tqueue.c is mainly designed to pass tuples between parallel workers
and currently it is used in Funnel operator to gather the tuples
generated
by all the parallel workers. I think we can use it for any other
operator
which needs tuple communication among parallel workers.

Some specifics of the Funnel operator seem to be a part of tqueue, which
doesn't make sense to me. For instance, reading from the set of queues
in a round-robin fashion is part of the Funnel algorithm, and doesn't
seem suitable for a generic tuple communication mechanism (that would
never allow order-sensitive reading, for example).

Okay, this makes sense to me, I think it is better to move Funnel
operator specific parts out of tqueue.c unless Robert or anybody else
feels otherwise.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#293Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#291)
Re: Parallel Seq Scan

On Tue, Jul 7, 2015 at 6:19 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Fri, Jul 3, 2015 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Attached, find the rebased version of patch.

Note - You need to first apply the assess-parallel-safety patch which

you

can find at:

/messages/by-id/CAA4eK1JjsfE_dOsHTr_z1P_cBKi_X4C4X3d7Nv=VWX9fs7qdJA@mail.gmail.com

I ran some performance tests on a 16 core machine with large shared
buffers, so there is no IO involved.
With the default value of cpu_tuple_comm_cost, parallel plan is not
getting generated even if we are selecting 100K records from 40
million records. So I changed the value to '0' and collected the
performance readings.

For reasonable default values for these parameters, still more testing
is required. I think instead of 0, tests with 0.001 or 0.0025 for default
of cpu_tuple_comm_cost and 100 or 1000 for default of parallel_setup_cost
would have been more interesting.

Here are the performance numbers:

The average table row size is around 500 bytes and query selection
column width is around 36 bytes.
when the query selectivity goes more than 10% of total table records,
the parallel scan performance is dropping.

These are quite similar to what I have seen in my initial tests, now I
think if you add some complex condition in the filter, you will see gains
for even 25% or more selectivity (I have added factorial 10 calculation in
filter to mimic the complex filter condition).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#294Jeff Davis
pgsql@j-davis.com
In reply to: Amit Kapila (#292)
Re: Parallel Seq Scan

On Tue, 2015-07-07 at 09:27 +0530, Amit Kapila wrote:

I am not sure how many blocks difference could be considered okay for
deviation?

In my testing (a long time ago) deviations of tens of blocks didn't show
a problem.

However, an assumption of the sync scan work was that the CPU is
processing faster than the IO system; whereas the parallel scan patch
assumes that the IO system is faster than a single core. So perhaps the
features are incompatible after all. Only testing will say for sure.

Then again, syncscans are designed in such a way that they are unlikely
to hurt in any situation. Even if the scans diverge (or never converge
in the first place), it shouldn't be worse than starting at block zero
every time.

I'd prefer to leave syncscans intact for parallel scans unless you find
a reasonable situation where they perform worse. This shouldn't add any
complexity to the patch (if it does, let me know).

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#295Antonin Houska
ah@cybertec.at
In reply to: Amit Kapila (#287)
Re: Parallel Seq Scan

Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached, find the rebased version of patch.

[I haven't read this thread so far, sorry for possibly redundant comment.]

I noticed that false is passed for required_outer agrument of
create_partialseqscan_path(), while NULL seems to be cleaner in terms of C
language.

But in terms of semantics, I'm not sure this is correct anyway. Why does
create_parallelscan_paths() not accept the actual rel->lateral_relids, just
like create_seqscan_path() does? (See set_plain_rel_pathlist().) If there's
reason for your approach, I think it's worth a comment.

BTW, emacs shows whitespace on otherwise empty line parallelpath.c:57.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#296Amit Kapila
amit.kapila16@gmail.com
In reply to: Antonin Houska (#295)
Re: Parallel Seq Scan

On Wed, Jul 15, 2015 at 2:14 PM, Antonin Houska <ah@cybertec.at> wrote:

Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached, find the rebased version of patch.

[I haven't read this thread so far, sorry for possibly redundant comment.]

I noticed that false is passed for required_outer agrument of
create_partialseqscan_path(), while NULL seems to be cleaner in terms of C
language.

But in terms of semantics, I'm not sure this is correct anyway. Why does
create_parallelscan_paths() not accept the actual rel->lateral_relids,

just

like create_seqscan_path() does? (See set_plain_rel_pathlist().) If

there's

reason for your approach, I think it's worth a comment.

Right, I think this is left over from initial version where parallel seq
scan
was supported just for single table scan. It should probably do similar to
create_seqscan_path() and then pass the same down to
create_partialseqscan_path() and get_baserel_parampathinfo().

Thanks, I will fix this in next version of patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#297Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#296)
Re: Parallel Seq Scan

On Thu, Jul 16, 2015 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks, I will fix this in next version of patch.

I am posting in this thread as I am not sure, whether it needs a
separate thread or not?

I gone through the code and found that the newly added funnel node is
is tightly coupled with
partial seq scan, in order to add many more parallel plans along with
parallel seq scan,
we need to remove the integration of this node with partial seq scan.

To achieve the same, I have the following ideas.

Plan:
1) Add the funnel path immediately for every parallel path similar to
the current parallel seq scan,
but during the plan generation generate the funnel plan only for the
top funnel path and
ignore rest funnel paths.

2)Instead of adding a funnel path immediately after the partial seq
scan path is generated.
Add the funnel path in grouping_planner once the final rel path is
generated before creating the plan.

Execution:
The funnel execution varies based on the below plan node.
1) partial scan - Funnel does the local scan also and returns the tuples
2) partial agg - Funnel does the merging of aggregate results and
returns the final result.

Any other better ideas to achieve the same?

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#298Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#297)
Re: Parallel Seq Scan

On Fri, Jul 17, 2015 at 1:22 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Jul 16, 2015 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Thanks, I will fix this in next version of patch.

I am posting in this thread as I am not sure, whether it needs a
separate thread or not?

I gone through the code and found that the newly added funnel node is
is tightly coupled with
partial seq scan, in order to add many more parallel plans along with
parallel seq scan,
we need to remove the integration of this node with partial seq scan.

This assumption is wrong, Funnel node can execute any node beneath
it (Refer ExecFunnel->funnel_getnext->ExecProcNode, similarly you
can see exec_parallel_stmt). Yes, currently nodes supported under
Funnel nodes are limited like partialseqscan, result (due to reasons
mentioned upthread like readfuncs.s doesn't have support to read Plan
nodes which is required for worker backend to read the PlannedStmt,
ofcourse we can add them, but as we are supportting parallelism for
limited nodes, so I have not enhanced the readfuncs.c) but in general
the basic infrastructure is designed such a way that it can support
other nodes beneath it.

To achieve the same, I have the following ideas.

Execution:
The funnel execution varies based on the below plan node.
1) partial scan - Funnel does the local scan also and returns the tuples
2) partial agg - Funnel does the merging of aggregate results and
returns the final result.

Basically Funnel will execute any node beneath it, the Funnel node itself
is not responsible for doing local scan or any form of consolidation of
results, as of now, it has these 3 basic properties
– Has one child, runs multiple copies in parallel.
– Combines the results into a single tuple stream.
– Can run the child itself if no workers available.

Any other better ideas to achieve the same?

Refer slides 16-19 in Parallel Sequential Scan presentation in PGCon
https://www.pgcon.org/2015/schedule/events/785.en.html

I don't have very clear idea what is the best way to transform the nodes
in optimizer, but I think we can figure that out later unless majority
people
see that as blocking factor.

Thanks for looking into patch!

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#299Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#298)
Re: Parallel Seq Scan

On Mon, Jul 20, 2015 at 3:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jul 17, 2015 at 1:22 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Jul 16, 2015 at 1:10 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Thanks, I will fix this in next version of patch.

I am posting in this thread as I am not sure, whether it needs a
separate thread or not?

I gone through the code and found that the newly added funnel node is
is tightly coupled with
partial seq scan, in order to add many more parallel plans along with
parallel seq scan,
we need to remove the integration of this node with partial seq scan.

This assumption is wrong, Funnel node can execute any node beneath
it (Refer ExecFunnel->funnel_getnext->ExecProcNode, similarly you
can see exec_parallel_stmt).

Yes, funnel node can execute any node beneath it. But during the planning
phase, the funnel path is added on top of partial scan path. I just want the
same to enhanced to support other parallel nodes.

Yes, currently nodes supported under
Funnel nodes are limited like partialseqscan, result (due to reasons
mentioned upthread like readfuncs.s doesn't have support to read Plan
nodes which is required for worker backend to read the PlannedStmt,
ofcourse we can add them, but as we are supportting parallelism for
limited nodes, so I have not enhanced the readfuncs.c) but in general
the basic infrastructure is designed such a way that it can support
other nodes beneath it.

To achieve the same, I have the following ideas.

Execution:
The funnel execution varies based on the below plan node.
1) partial scan - Funnel does the local scan also and returns the tuples
2) partial agg - Funnel does the merging of aggregate results and
returns the final result.

Basically Funnel will execute any node beneath it, the Funnel node itself
is not responsible for doing local scan or any form of consolidation of
results, as of now, it has these 3 basic properties
– Has one child, runs multiple copies in parallel.
– Combines the results into a single tuple stream.
– Can run the child itself if no workers available.

+ if (!funnelstate->local_scan_done)
+ {
+ outerPlan = outerPlanState(funnelstate);
+
+ outerTupleSlot = ExecProcNode(outerPlan);

From the above code in funnel_getnext function, it directly does the
calls the below
node to do the scan in the backend side also. This code should refer the below
node type, based on that only it can go for the backend scan.

I feel executing outer plan always may not be correct for other parallel nodes.

Any other better ideas to achieve the same?

Refer slides 16-19 in Parallel Sequential Scan presentation in PGCon
https://www.pgcon.org/2015/schedule/events/785.en.html

Thanks for the information.

I don't have very clear idea what is the best way to transform the nodes
in optimizer, but I think we can figure that out later unless majority
people
see that as blocking factor.

I am also not finding it as a blocking factor for parallel scan.
I written the above mail to get some feedback/suggestions from hackers on
how to proceed in adding other parallelism nodes along with parallel scan.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#300Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#291)
Re: Parallel Seq Scan

On Mon, Jul 6, 2015 at 8:49 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I ran some performance tests on a 16 core machine with large shared
buffers, so there is no IO involved.
With the default value of cpu_tuple_comm_cost, parallel plan is not
getting generated even if we are selecting 100K records from 40
million records. So I changed the value to '0' and collected the
performance readings.

Here are the performance numbers:

selectivity(millions) Seq scan(ms) Parallel scan
2 workers
4 workers 8 workers
0.1 11498.93 4821.40
3305.84 3291.90
0.4 10942.98 4967.46
3338.58 3374.00
0.8 11619.44 5189.61
3543.86 3534.40
1.5 12585.51 5718.07
4162.71 2994.90
2.7 14725.66 8346.96
10429.05 8049.11
5.4 18719.00 20212.33 21815.19
19026.99
7.2 21955.79 28570.74 28217.60
27042.27

The average table row size is around 500 bytes and query selection
column width is around 36 bytes.
when the query selectivity goes more than 10% of total table records,
the parallel scan performance is dropping.

Thanks for doing this testing. I think that is quite valuable. I am
not too concerned about the fact that queries where more than 10% of
records are selected do not speed up. Obviously, it would be nice to
improve that, but I think that can be left as an area for future
improvement.

One thing I noticed that is a bit dismaying is that we don't get a lot
of benefit from having more workers. Look at the 0.1 data. At 2
workers, if we scaled perfectly, we would be 3x faster (since the
master can do work too), but we are actually 2.4x faster. Each
process is on the average 80% efficient. That's respectable. At 4
workers, we would be 5x faster with perfect scaling; here we are 3.5x
faster. So the third and fourth worker were about 50% efficient.
Hmm, not as good. But then going up to 8 workers bought us basically
nothing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#301Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#300)
Re: Parallel Seq Scan

On Wed, Jul 22, 2015 at 9:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

One thing I noticed that is a bit dismaying is that we don't get a lot
of benefit from having more workers. Look at the 0.1 data. At 2
workers, if we scaled perfectly, we would be 3x faster (since the
master can do work too), but we are actually 2.4x faster. Each
process is on the average 80% efficient. That's respectable. At 4
workers, we would be 5x faster with perfect scaling; here we are 3.5x
faster. So the third and fourth worker were about 50% efficient.
Hmm, not as good. But then going up to 8 workers bought us basically
nothing.

I think the improvement also depends on how costly is the qualification,
if it is costly, even for same selectivity the gains will be shown
till higher
number of clients and for simple qualifications, we will see that cost of
having more workers will start dominating (processing data over multiple
tuple queues) over the benefit we can achieve by them.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#302Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Amit Kapila (#301)
Re: Parallel Seq Scan

Hi Amit,

The latest v16 patch cannot be applied to the latest
master as is.
434873806a9b1c0edd53c2a9df7c93a8ba021147 changed various
lines in heapam.c, so it probably conflicts with this.

[kaigai@magro sepgsql]$ cat ~/patch/parallel_seqscan_v16.patch | patch -p1
patching file src/backend/access/common/printtup.c
patching file src/backend/access/heap/heapam.c
Hunk #4 succeeded at 499 (offset 10 lines).
Hunk #5 succeeded at 533 (offset 10 lines).
Hunk #6 FAILED at 678.
Hunk #7 succeeded at 790 (offset 10 lines).
Hunk #8 succeeded at 821 (offset 10 lines).
Hunk #9 FAILED at 955.
Hunk #10 succeeded at 1365 (offset 10 lines).
Hunk #11 succeeded at 1375 (offset 10 lines).
Hunk #12 succeeded at 1384 (offset 10 lines).
Hunk #13 succeeded at 1393 (offset 10 lines).
Hunk #14 succeeded at 1402 (offset 10 lines).
Hunk #15 succeeded at 1410 (offset 10 lines).
Hunk #16 succeeded at 1439 (offset 10 lines).
Hunk #17 succeeded at 1533 (offset 10 lines).
2 out of 17 hunks FAILED -- saving rejects to file src/backend/access/heap/heapam.c.rej
:

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Amit Kapila
Sent: Thursday, July 23, 2015 8:43 PM
To: Robert Haas
Cc: Haribabu Kommi; Gavin Flower; Jeff Davis; Andres Freund; Kaigai Kouhei(海
外 浩平); Amit Langote; Amit Langote; Fabrízio Mello; Thom Brown; Stephen Frost;
pgsql-hackers
Subject: Re: [HACKERS] Parallel Seq Scan

On Wed, Jul 22, 2015 at 9:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

One thing I noticed that is a bit dismaying is that we don't get a lot
of benefit from having more workers. Look at the 0.1 data. At 2
workers, if we scaled perfectly, we would be 3x faster (since the
master can do work too), but we are actually 2.4x faster. Each
process is on the average 80% efficient. That's respectable. At 4
workers, we would be 5x faster with perfect scaling; here we are 3.5x
faster. So the third and fourth worker were about 50% efficient.
Hmm, not as good. But then going up to 8 workers bought us basically
nothing.

I think the improvement also depends on how costly is the qualification,
if it is costly, even for same selectivity the gains will be shown till higher
number of clients and for simple qualifications, we will see that cost of
having more workers will start dominating (processing data over multiple
tuple queues) over the benefit we can achieve by them.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com/&gt;

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#303Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#301)
Re: Parallel Seq Scan

On Thu, Jul 23, 2015 at 9:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 22, 2015 at 9:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

One thing I noticed that is a bit dismaying is that we don't get a lot
of benefit from having more workers. Look at the 0.1 data. At 2
workers, if we scaled perfectly, we would be 3x faster (since the
master can do work too), but we are actually 2.4x faster. Each
process is on the average 80% efficient. That's respectable. At 4
workers, we would be 5x faster with perfect scaling; here we are 3.5x
faster. So the third and fourth worker were about 50% efficient.
Hmm, not as good. But then going up to 8 workers bought us basically
nothing.

I think the improvement also depends on how costly is the qualification,
if it is costly, even for same selectivity the gains will be shown till
higher
number of clients and for simple qualifications, we will see that cost of
having more workers will start dominating (processing data over multiple
tuple queues) over the benefit we can achieve by them.

Yes, That's correct. when the qualification cost is increased, the performance
is also increasing with number of workers.

Instead of using all the configured workers per query, how about deciding number
of workers based on cost of the qualification? I am not sure whether we have
any information available to find out the qualification cost. This way
the workers
will be distributed to all backends properly.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#304Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Haribabu Kommi (#303)
Re: Parallel Seq Scan

Hi Amit,

Could you tell me the code intention around ExecInitFunnel()?

ExecInitFunnel() calls InitFunnel() that opens the relation to be
scanned by the underlying PartialSeqScan and setup ss_ScanTupleSlot
of its scanstate.
According to the comment of InitFunnel(), it open the relation and
takes appropriate lock on it. However, an equivalent initialization
is also done on InitPartialScanRelation().

Why does it acquire the relation lock twice?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#305Amit Kapila
amit.kapila16@gmail.com
In reply to: Kouhei Kaigai (#304)
Re: Parallel Seq Scan

On Wed, Jul 29, 2015 at 7:32 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hi Amit,

Could you tell me the code intention around ExecInitFunnel()?

ExecInitFunnel() calls InitFunnel() that opens the relation to be
scanned by the underlying PartialSeqScan and setup ss_ScanTupleSlot
of its scanstate.

The main need is for relation descriptor which is then required to set
the scan tuple's slot. Basically it is required for tuples flowing from
worker which will use the scan tuple slot of FunnelState.

According to the comment of InitFunnel(), it open the relation and
takes appropriate lock on it. However, an equivalent initialization
is also done on InitPartialScanRelation().

Why does it acquire the relation lock twice?

I think locking twice is not required, it is just that I have used the API
ExecOpenScanRelation() which is used during other node's initialisation
due to which it lock's twice. I think in general it should be harmless.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#306Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Amit Kapila (#305)
Re: Parallel Seq Scan

On Wed, Jul 29, 2015 at 7:32 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hi Amit,

Could you tell me the code intention around ExecInitFunnel()?

ExecInitFunnel() calls InitFunnel() that opens the relation to be
scanned by the underlying PartialSeqScan and setup ss_ScanTupleSlot
of its scanstate.

The main need is for relation descriptor which is then required to set
the scan tuple's slot. Basically it is required for tuples flowing from
worker which will use the scan tuple slot of FunnelState.

According to the comment of InitFunnel(), it open the relation and
takes appropriate lock on it. However, an equivalent initialization
is also done on InitPartialScanRelation().

Why does it acquire the relation lock twice?

I think locking twice is not required, it is just that I have used the API
ExecOpenScanRelation() which is used during other node's initialisation
due to which it lock's twice. I think in general it should be harmless.

Thanks, I could get reason of the implementation.

It looks to me this design is not problematic even if Funnel gets capability
to have multiple sub-plans thus is not associated with a particular relation
as long as target-list and projection-info are appropriately initialized.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#307Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Kouhei Kaigai (#306)
Re: Parallel Seq Scan

Amit,

Let me ask three more detailed questions.

Why Funnel has a valid qual of the subplan?
The 2nd argument of make_funnel() is qualifier of the subplan
(PartialSeqScan) then it is initialized at ExecInitFunnel,
but never executed on the run-time. Why does Funnel node has
useless qualifier expression here (even though it is harmless)?

Why Funnel delivered from Scan? Even though it constructs
a compatible target-list with underlying partial-scan node,
it does not require the node is also delivered from Scan.
For example, Sort or Append don't change the target-list
definition from its input, also don't have its own qualifier.
It seems to me the definition below is more suitable...
typedef struct Funnel
{
Plan plan;
int num_workers;
} Funnel;

Does ExecFunnel() need to have a special code path to handle
EvalPlanQual()? Probably, it just calls underlying node in the
local context. ExecScan() of PartialSeqScan will check its
qualifier towards estate->es_epqTuple[].

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Thursday, July 30, 2015 8:45 AM
To: Amit Kapila
Cc: Robert Haas; Gavin Flower; Jeff Davis; Andres Freund; Amit Langote; Amit
Langote; Fabrízio Mello; Thom Brown; Stephen Frost; pgsql-hackers; Haribabu Kommi
Subject: Re: [HACKERS] Parallel Seq Scan

On Wed, Jul 29, 2015 at 7:32 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hi Amit,

Could you tell me the code intention around ExecInitFunnel()?

ExecInitFunnel() calls InitFunnel() that opens the relation to be
scanned by the underlying PartialSeqScan and setup ss_ScanTupleSlot
of its scanstate.

The main need is for relation descriptor which is then required to set
the scan tuple's slot. Basically it is required for tuples flowing from
worker which will use the scan tuple slot of FunnelState.

According to the comment of InitFunnel(), it open the relation and
takes appropriate lock on it. However, an equivalent initialization
is also done on InitPartialScanRelation().

Why does it acquire the relation lock twice?

I think locking twice is not required, it is just that I have used the API
ExecOpenScanRelation() which is used during other node's initialisation
due to which it lock's twice. I think in general it should be harmless.

Thanks, I could get reason of the implementation.

It looks to me this design is not problematic even if Funnel gets capability
to have multiple sub-plans thus is not associated with a particular relation
as long as target-list and projection-info are appropriately initialized.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#308Amit Kapila
amit.kapila16@gmail.com
In reply to: Kouhei Kaigai (#307)
Re: Parallel Seq Scan

On Sun, Aug 2, 2015 at 8:06 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Amit,

Let me ask three more detailed questions.

Why Funnel has a valid qual of the subplan?
The 2nd argument of make_funnel() is qualifier of the subplan
(PartialSeqScan) then it is initialized at ExecInitFunnel,
but never executed on the run-time. Why does Funnel node has
useless qualifier expression here (even though it is harmless)?

The idea is that if in some case the qualification can't be
pushed down (consider the case where qualification contains
parallel restricted functions (functions that can only be
executed in master backend)) and needs to be only executed
in master backend, then we need it in Funnel node, so that it
can be executed for tuples passed by worker backends. It is
currently not used, but I think we should retain it as it is
because it can be used in some cases either as part of this
patch itself or in future. As of now, it is used in other
places in patch (like during Explain) as well, although we
might want to optimize the same, but overall I think it is
required.

Why Funnel delivered from Scan? Even though it constructs
a compatible target-list with underlying partial-scan node,
it does not require the node is also delivered from Scan.

It needs it's own target-list due to reason mentioned above
for qual and yet another reason is that the same is required
for FunnelState which inturn is required ScanSlot used to
retrieve tuples from workers. Also it is not excatly same
as partialseqscan, because for the case when the partialseqscan
node is executed by worker, we modify the targetlist as well,
refer create_parallel_worker_plannedstmt().

Does ExecFunnel() need to have a special code path to handle
EvalPlanQual()? Probably, it just calls underlying node in the
local context. ExecScan() of PartialSeqScan will check its
qualifier towards estate->es_epqTuple[].

Isn't EvalPlanQual() called for modifytable node and which
won't be allowed in parallel mode, so I think EvalPlanQual()
is not required for ExecFunnel path.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#309Amit Kapila
amit.kapila16@gmail.com
In reply to: Kouhei Kaigai (#302)
2 attachment(s)
Re: Parallel Seq Scan

On Thu, Jul 23, 2015 at 7:43 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hi Amit,

The latest v16 patch cannot be applied to the latest
master as is.
434873806a9b1c0edd53c2a9df7c93a8ba021147 changed various
lines in heapam.c, so it probably conflicts with this.

Attached, find the rebased version of patch. It fixes the comments raised
by Jeff Davis and Antonin Houska. The main changes in this version are
now it supports sync scan along with parallel sequential scan (refer
heapam.c)
and the patch has been split into two parts, first contains the code for
Funnel node and infrastructure to support the same and second contains
the code for PartialSeqScan node and its infrastructure.

Note - To test the patch, you need to first apply the assess-parallel-safety
patch [1]/messages/by-id/CAA4eK1Kd2SunKX=e5sSFSrFfc++_uHnt5_HyKd+XykFjDWZseQ@mail.gmail.com and then apply parallel_seqscan_funnel_v17.patch attached with
this mail and then apply parallel_seqscan_partialseqscan_v17.patch attached
with this mail.

[1]: /messages/by-id/CAA4eK1Kd2SunKX=e5sSFSrFfc++_uHnt5_HyKd+XykFjDWZseQ@mail.gmail.com
/messages/by-id/CAA4eK1Kd2SunKX=e5sSFSrFfc++_uHnt5_HyKd+XykFjDWZseQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_funnel_v17.patchapplication/octet-stream; name=parallel_seqscan_funnel_v17.patchDownload
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..639451a 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -26,9 +26,9 @@
 
 static void printtup_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-static void printtup(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_20(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
 static void printtup_shutdown(DestReceiver *self);
 static void printtup_destroy(DestReceiver *self);
 
@@ -299,7 +299,7 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
  *		printtup --- print a tuple in protocol 3.0
  * ----------------
  */
-static void
+static bool
 printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -376,13 +376,15 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
  *		printtup_20 --- print a tuple in protocol 2.0
  * ----------------
  */
-static void
+static bool
 printtup_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -452,6 +454,8 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
@@ -528,7 +532,7 @@ debugStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		debugtup - print one tuple for an interactive backend
  * ----------------
  */
-void
+bool
 debugtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -553,6 +557,8 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
 		printatt((unsigned) i + 1, typeinfo->attrs[i], value);
 	}
 	printf("\t----\n");
+
+	return true;
 }
 
 /* ----------------
@@ -564,7 +570,7 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
  * This is largely same as printtup_20, except we use binary formatting.
  * ----------------
  */
-static void
+static bool
 printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -636,4 +642,6 @@ printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8db1b35..b55c4dc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4414,7 +4414,7 @@ copy_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * copy_dest_receive --- receive one tuple
  */
-static void
+static bool
 copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_copy    *myState = (DR_copy *) self;
@@ -4426,6 +4426,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 	/* And send the data */
 	CopyOneRowTo(cstate, InvalidOid, slot->tts_values, slot->tts_isnull);
 	myState->processed++;
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 41183f6..418b0f6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,7 +62,7 @@ typedef struct
 static ObjectAddress CreateAsReladdr = {InvalidOid, InvalidOid, 0};
 
 static void intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void intorel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool intorel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void intorel_shutdown(DestReceiver *self);
 static void intorel_destroy(DestReceiver *self);
 
@@ -482,7 +482,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * intorel_receive --- receive one tuple
  */
-static void
+static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
@@ -507,6 +507,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 4f32400..69d3b34 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -20,6 +20,7 @@
 #include "commands/defrem.h"
 #include "commands/prepare.h"
 #include "executor/hashjoin.h"
+#include "executor/nodeFunnel.h"
 #include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
@@ -733,6 +734,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -940,6 +942,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1090,6 +1095,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1234,6 +1240,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend
+	 * workers for Funnel node.  Though we already accumulate this
+	 * information when last tuple is fetched from Funnel node, this
+	 * is to cover cases when we don't fetch all tuples from a node
+	 * such as for Limit node.
+	 */
+	if (es->analyze && nodeTag(plan) == T_Funnel)
+		FinishParallelSetupAndAccumStats((FunnelState *)planstate);
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1363,6 +1379,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+				((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2422,6 +2446,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5492e59..750a59c 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -56,7 +56,7 @@ typedef struct
 static int	matview_maintenance_depth = 0;
 
 static void transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static void refresh_matview_datafill(DestReceiver *dest, Query *query,
@@ -422,7 +422,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * transientrel_receive --- receive one tuple
  */
-static void
+static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
@@ -441,6 +441,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 08cba6f..8037417 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -13,17 +13,17 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execProcnode.o execQual.o execScan.o execTuples.o \
+       execMain.o execParallel.o execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
        nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-       nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
+       nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o spi.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 93e1e9a..4915151 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -160,6 +161,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -467,6 +472,9 @@ ExecSupportsBackwardScan(Plan *node)
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
 
+		case T_Funnel:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..650fcc5 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b62e88b..e35cbbe 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -45,9 +45,11 @@
 #include "commands/matview.h"
 #include "commands/trigger.h"
 #include "executor/execdebug.h"
+#include "executor/execParallel.h"
 #include "foreign/fdwapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -323,6 +325,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	operation = queryDesc->operation;
 	dest = queryDesc->dest;
 
+	/* inform executor to collect buffer usage stats from parallel workers. */
+	estate->total_time = queryDesc->totaltime ? 1 : 0;
+
 	/*
 	 * startup tuple receiver, if we will be emitting tuples
 	 */
@@ -354,7 +359,15 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		(*dest->rShutdown) (dest);
 
 	if (queryDesc->totaltime)
+	{
+		/*
+		 * Accumulate the stats by parallel workers before stopping the
+		 * node.
+		 */
+		(void) planstate_tree_walker((Node*) queryDesc->planstate,
+									 NULL, ExecParallelBufferUsageAccum, 0);
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -1581,7 +1594,15 @@ ExecutePlan(EState *estate,
 		 * practice, this is probably always the case at this point.)
 		 */
 		if (sendTuples)
-			(*dest->receiveSlot) (slot, dest);
+		{
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
+		}
 
 		/*
 		 * Count tuples processed, if this is a SELECT.  (For other operation
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 0000000..806f060
--- /dev/null
+++ b/src/backend/executor/execParallel.c
@@ -0,0 +1,559 @@
+/*-------------------------------------------------------------------------
+ *
+ * execParallel.c
+ *	  Support routines for setting up backend workers for parallel execution.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void
+EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState* planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size);
+static void
+StorePlannedStmt(ParallelContext *pcxt, PlanState* planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * This is required for parallel plan execution to fetch the information
+ * from dsm.
+ */
+static shm_toc *parallel_shm_toc = NULL;
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of bind
+ * parameters, PARAM_EXEC parameters and instrumentation information that
+ * need to be retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	*params_exec_size = EstimateExecParametersSpace(serialized_param_exec_vals);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_exec_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the usage along with it's own, so account it for each worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	/*
+	 * We expect each worker to populate the instrumentation structure
+	 * allocated by master backend and then master backend will aggregate
+	 * all the information, so account it for each worker.
+	 */
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 4);
+}
+
+/*
+ * StoreParallelSupportInfo
+ * 
+ * Sets up the bind parameters, PARAM_EXEC parameters and instrumentation
+ * information required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	*paramsdata;
+	char	*paramsexecdata;
+	int		*inst_options;
+
+	/*
+	 * Store bind parameter's list in dynamic shared memory.  This is
+	 * used for parameters in prepared query.
+	 */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Store PARAM_EXEC parameters list in dynamic shared memory.  This is
+	 * used for evaluation plan->initPlan params.
+	 */
+	paramsexecdata = shm_toc_allocate(pcxt->toc, params_exec_size);
+	SerializeExecParams(serialized_param_exec_vals, params_exec_size, paramsexecdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS_EXEC, paramsexecdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by
+	 * each worker.
+	 */
+	*buffer_usage_space =
+			shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by
+	 * each worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePlannedStmtSpace
+ *
+ * Estimate the amount of space required to record information of planned
+ * statement and parallel node specific information that need to be copied
+ * to parallel workers.
+ */
+void
+EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState* planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size)
+{
+	/* Estimate space for planned statement. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	/* keys for planned statement information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	(void) planstate_tree_walker((Node*)planstate, pcxt, ExecParallelEstimate,
+								 pscan_size);
+}
+
+/*
+ * StorePlannedStmt
+ * 
+ * Sets up the planned statement and node specific information.
+ */
+void
+StorePlannedStmt(ParallelContext *pcxt, PlanState* planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size)
+{
+	char		*plannedstmtdata;
+
+	/* Store planned statement in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	(void) planstate_tree_walker((Node*)planstate, pcxt, ExecParallelInitializeDSM,
+								 &pscan_size);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of tuple
+ * queues that need to be established between parallel workers and master
+ * backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for tuple queues. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ * 
+ * It sets up the response queues for backend workers to return tuples
+ * to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq		*mq;
+	char		*tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle**) palloc(pcxt->nworkers * sizeof(shm_mq_handle*));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory.
+	 * These queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+	   shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+		
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll automatically
+		 * detach the queue if we error out.  Otherwise, the worker might sit
+		 * there trying to write the queue long after we've gone away.
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * ExecParallelEstimate
+ *
+ * Estimate the amount of space required to record information of
+ * parallel node that need to be copied to parallel workers.
+ */
+bool
+ExecParallelEstimate(Node *node, ParallelContext *pcxt,
+					 Size *pscan_size)
+{
+	if (node == NULL)
+		return false;
+
+	/*
+	 * As of now, we support few nodes that can be passed to parallel
+	 * workers, so handle only those nodes.
+	 */
+	switch (nodeTag(node))
+	{
+		default:
+			break;
+	}
+
+	return false;
+}
+
+/*
+ *	ExecParallelInitializeDSM
+ *
+ *		Store the information of parallel node in dsm.
+ */
+bool
+ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
+						  Size *pscan_size)
+{
+	if (node == NULL)
+		return false;
+
+	/*
+	 * As of now, we support few nodes that can be passed to parallel
+	 * workers, so handle only those nodes.
+	 */
+	switch (nodeTag(node))
+	{
+		default:
+			break;
+	}
+
+	return false;
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ * Sets up the required infrastructure for backend workers to perform
+ * execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(PlanState *planstate,
+						  List *serialized_param_exec_vals,
+						  EState *estate,
+						  char **inst_options_space,
+						  char **buffer_usage_space,
+						  shm_mq_handle ***responseqp,
+						  ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size, params_exec_size, pscan_size, plannedstmt_size;
+	char		*plannedstmt_str;
+	PlannedStmt	*plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt(planstate->plan,
+													 estate->es_range_table,
+													 estate->es_plannedstmt->nParamExec);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePlannedStmtSpace(pcxt, planstate, plannedstmt_str,
+							 &plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 serialized_param_exec_vals,
+									 estate->es_instrument, &params_size,
+									 &params_exec_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+	
+	StorePlannedStmt(pcxt, planstate, plannedstmt_str,
+					 plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 serialized_param_exec_vals,
+							 estate->es_instrument,
+							 params_size,
+							 params_exec_size,
+							 inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment and get the
+ * bind parameters, PARAM_EXEC parameters and instrumentation information
+ * required to perform parallel operation.
+ */
+void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char		*paramsdata;
+	char		*paramsexecdata;
+	char		*inst_options_space;
+	char		*buffer_usage_space;
+	int			*instoptions;
+
+	if (params)
+	{
+		paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+		*params = RestoreBoundParams(paramsdata);
+	}
+
+	if (serialized_param_exec_vals)
+	{
+		paramsexecdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS_EXEC);
+		*serialized_param_exec_vals = RestoreExecParams(paramsexecdata);
+	}
+
+	if (inst_options)
+	{
+		instoptions	= shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+		*inst_options = *instoptions;
+		if (inst_options)
+		{
+			inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+			*instrument = (inst_options_space +
+				ParallelWorkerNumber * sizeof(Instrumentation));
+		}
+	}
+
+	if (buffer_usage)
+	{
+		buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+		*buffer_usage = (buffer_usage_space +
+					 ParallelWorkerNumber * sizeof(BufferUsage));
+	}
+}
+
+/*
+ * ExecParallelGetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment and get the
+ * planned statement required to perform parallel operation.
+ */
+void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char		*plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_node_funcids((*plannedstmt)->planTree);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment and get the
+ * tuple queue information for a particular worker, attach to the queue
+ * and redirect all futher responses from worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char		*tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+		ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * GetParallelShmToc
+ */
+shm_toc *
+GetParallelShmToc(void)
+{
+	return parallel_shm_toc;
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information to
+ * parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq			*mq;
+	shm_mq_handle	*responseq;
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	int				inst_options;
+	char			*instrument = NULL;
+	char			*buffer_usage = NULL;
+	ParallelStmt	*parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	ExecParallelGetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &serialized_param_exec_vals,
+						   &inst_options, &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params	= params;
+	parallelstmt->serialized_param_exec_vals = serialized_param_exec_vals;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->responseq = responseq;
+
+	parallel_shm_toc = toc;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from shared memory
+	 * message queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
+
+/*
+ * ExecParallelBufferUsageAccum
+ *
+ * Recursively accumulate the stats for all the funnel nodes in a plan
+ * state tree.
+ */
+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_FunnelState:
+			{
+				FinishParallelSetupAndAccumStats((FunnelState*)node);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
+
+	(void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree, NULL,
+								 ExecParallelBufferUsageAccum, 0);
+	(void) planstate_tree_walker((Node*)((PlanState *)node)->righttree, NULL,
+								 ExecParallelBufferUsageAccum, 0);
+	return false;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 03c2feb..c181bf2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -196,6 +197,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -416,6 +422,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -658,6 +668,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a05d8b1..d5619bd 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1313,7 +1313,7 @@ do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull)
 	ExecStoreVirtualTuple(slot);
 
 	/* send the tuple to the receiver */
-	(*tstate->dest->receiveSlot) (slot, tstate->dest);
+	(void) (*tstate->dest->receiveSlot) (slot, tstate->dest);
 
 	/* clean up */
 	ExecClearTuple(slot);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 3c611b9..a0d6441 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -976,3 +976,27 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 
 	MemoryContextSwitchTo(oldcontext);
 }
+
+/*
+ * Populate the values of PARAM_EXEC parameters.
+ *
+ * This is used by worker backends to fill in the values of PARAM_EXEC
+ * parameters after fetching the same from dynamic shared memory.
+ * This needs to be called before ExecutorRun.
+ */
+void
+PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals)
+{
+	ListCell	*lparam;
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].value =
+																param_val->value;
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].isnull =
+																param_val->isnull;
+	}
+}
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..863bd64 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -167,7 +167,7 @@ static Datum postquel_get_single_result(TupleTableSlot *slot,
 static void sql_exec_error_callback(void *arg);
 static void ShutdownSQLFunction(Datum arg);
 static void sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
 static void sqlfunction_shutdown(DestReceiver *self);
 static void sqlfunction_destroy(DestReceiver *self);
 
@@ -1903,7 +1903,7 @@ sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * sqlfunction_receive --- receive one tuple
  */
-static void
+static bool
 sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_sqlfunction *myState = (DR_sqlfunction *) self;
@@ -1913,6 +1913,8 @@ sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Store the filtered tuple into the tuplestore */
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..639eb04 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,29 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used to aggregate
+ * the information of worker backends.  We only need to sum the buffer
+ * usage and tuple count statistics as for other timing related statistics
+ * it is sufficient to have the master backends information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +166,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..5266cb8
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,436 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for scanning a relation via multiple workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation using worker backends.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		Re-initialize the workers and rescans a relation via them.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "executor/nodeSubplan.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+static void ExecAccumulateInstInfo(FunnelState *node);
+static void ExecAccumulateBufUsageInfo(FunnelState *node);
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		InitFunnel
+ *
+ *		Set up parallel state information
+ * ----------------------------------------------------------------
+ */
+static void
+InitFunnel(FunnelState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((SeqScan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel *node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	 /* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	InitFunnel(funnelstate, estate, eflags);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignScanProjectionInfo(&funnelstate->ss);
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution.
+	 * We do this on first execution rather than during node initialization,
+	 * as it needs to allocate large dynamic segement, so it is better to 
+	 * do if it is really needed.
+	 */
+	if (!node->pcxt)
+	{
+		EState		*estate = node->ss.ps.state;
+		ExprContext *econtext = node->ss.ps.ps_ExprContext;
+		bool		any_worker_launched = false;
+		List		*serialized_param_exec;
+
+		/*
+		 * Evaluate the InitPlan and pass the PARAM_EXEC params, so that
+		 * values can be shared with worker backend.  This is different
+		 * from the way InitPlans are evaluated (lazy evaluation) at other
+		 * places as instead of sharing the InitPlan to all the workers
+		 * and let them execute, we pass the values which can be directly
+		 * used by worker backends.
+		 */
+		serialized_param_exec = ExecAndFormSerializeParamExec(econtext,
+											node->ss.ps.plan->lefttree->allParam);
+
+		/* Initialize the workers required to execute funnel node. */
+		InitializeParallelWorkers(node->ss.ps.lefttree,
+								  serialized_param_exec,
+								  estate,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+								  ((Funnel *)(node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are
+		 * not available then we perform the scan with available workers and
+		 * if there are no more workers available, then the funnel node will
+		 * just scan locally.
+		 */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+	
+	slot = funnel_getnext(node);
+	
+	if (TupIsNull(slot))
+	{
+
+		/*
+		 * Destroy the parallel context once we complete fetching all
+		 * the tuples, this will ensure that if in the same statement
+		 * we need to have Funnel node for multiple parts of statement,
+		 * it won't accumulate lot of dsm segments and workers can be made
+		 * available to use by other parts of statement.
+		 */
+		FinishParallelSetupAndAccumStats(node);
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+
+	ExecEndNode(outerPlanState(node));
+
+	FinishParallelSetupAndAccumStats(node);
+}
+
+/*
+ * funnel_getnext
+ *
+ * Get the next tuple from shared memory queue.  This function
+ * is reponsible for fetching tuples from all the queues associated
+ * with worker backends used in funnel scan and if there is no
+ * data available from queues or no worker is available, it does
+ * fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState		*outerPlan;
+	TupleTableSlot	*outerTupleSlot;
+	TupleTableSlot	*slot;
+	HeapTuple		tup;
+
+	if (funnelstate->ss.ps.ps_ProjInfo)
+		slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+	else
+		slot = funnelstate->ss.ss_ScanTupleSlot;
+
+	while ((!funnelstate->all_workers_done  && funnelstate->fs_workersReady) ||
+			!funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer, /* buffer associated with this
+											   * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *		FinishParallelSetupAndAccumStats
+ *
+ *		Destroy the setup for parallel workers.  Collect all the
+ *		stats after workers are stopped, else some work done by
+ *		workers won't be accounted.
+ * ----------------------------------------------------------------
+ */
+void
+FinishParallelSetupAndAccumStats(FunnelState *node)
+{
+	if (node->pcxt)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		if (node->fs_workersReady)
+		{
+			TupleQueueFunnelShutdown(node->funnel);
+			WaitForParallelWorkersToFinish(node->pcxt);
+		}
+
+		/* destroy the tuple queue */
+		DestroyTupleQueueFunnel(node->funnel);
+		node->funnel = NULL;
+
+		/*
+		 * Aggregate the buffer usage stats from all workers.  This is
+		 * required by external modules like pg_stat_statements.
+		 */
+		ExecAccumulateBufUsageInfo(node);
+
+		/*
+		 * Aggregate instrumentation information of all the backend
+		 * workers for Funnel node.  This has to be done before we
+		 * destroy the parallel context.
+		 */
+		if (node->ss.ps.state->es_instrument)
+			ExecAccumulateInstInfo(node);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+		node->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateInstInfo
+ *
+ *		Accumulate instrumentation information of all the workers
+ * ----------------------------------------------------------------
+ */
+void ExecAccumulateInstInfo(FunnelState *node)
+{
+	int i;
+	Instrumentation *instrument_worker;
+	int nworkers;
+	char *inst_info_workers;
+	
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		inst_info_workers = node->inst_options_space;
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *)(inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(node->ss.ps.instrument, instrument_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateBufUsageInfo
+ *
+ *		Accumulate buffer usage information of all the workers
+ * ----------------------------------------------------------------
+ */
+void ExecAccumulateBufUsageInfo(FunnelState *node)
+{
+	int i;
+	int nworkers;
+	BufferUsage *buffer_usage_worker;
+	char *buffer_usage;
+
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		buffer_usage = node->buffer_usage_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			buffer_usage_worker = (BufferUsage *)(buffer_usage + (i * sizeof(BufferUsage)));
+			BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Re-initialize the workers and rescans a relation via them.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform
+	 * rescan of relation.  We want to gracefully shutdown all the
+	 * workers so that they should be able to propagate any error
+	 * or other information to master backend before dying.
+	 */
+	FinishParallelSetupAndAccumStats(node);
+
+	ExecReScan(node->ss.ps.lefttree);
+}
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index e66bcda..c447062 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -144,6 +144,7 @@ ExecNestLoop(NestLoopState *node)
 			{
 				NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
 				int			paramno = nlp->paramno;
+				TupleDesc	tdesc = outerTupleSlot->tts_tupleDescriptor;
 				ParamExecData *prm;
 
 				prm = &(econtext->ecxt_param_exec_vals[paramno]);
@@ -154,6 +155,7 @@ ExecNestLoop(NestLoopState *node)
 				prm->value = slot_getattr(outerTupleSlot,
 										  nlp->paramval->varattno,
 										  &(prm->isnull));
+				prm->ptype = tdesc->attrs[nlp->paramval->varattno-1]->atttypid;
 				/* Flag parameter value as changed */
 				innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
 													 paramno);
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9eb4d63..6afd55a 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -30,11 +30,14 @@
 #include <math.h>
 
 #include "access/htup_details.h"
+#include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeSubplan.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
@@ -281,12 +284,14 @@ ExecScanSubPlan(SubPlanState *node,
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -399,6 +404,7 @@ ExecScanSubPlan(SubPlanState *node,
 			prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col, &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 
@@ -551,6 +557,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		 !TupIsNull(slot);
 		 slot = ExecProcNode(planstate))
 	{
+		TupleDesc	tdesc = slot->tts_tupleDescriptor;
 		int			col = 1;
 		ListCell   *plst;
 		bool		isnew;
@@ -568,6 +575,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col,
 										  &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col-1]->atttypid;
 			col++;
 		}
 		slot = ExecProject(node->projRight, NULL);
@@ -954,6 +962,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *l;
 	bool		found = false;
 	ArrayBuildStateAny *astate = NULL;
+	Oid			ptype;
 
 	if (subLinkType == ANY_SUBLINK ||
 		subLinkType == ALL_SUBLINK)
@@ -961,6 +970,8 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	if (subLinkType == CTE_SUBLINK)
 		elog(ERROR, "CTE subplans should not be executed via ExecSetParamPlan");
 
+	ptype = exprType((Node *) node->xprstate.expr);
+
 	/* Initialize ArrayBuildStateAny in caller's context, if needed */
 	if (subLinkType == ARRAY_SUBLINK)
 		astate = initArrayResultAny(subplan->firstColType,
@@ -983,12 +994,14 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState	*exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -1011,6 +1024,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(true);
+			prm->ptype = ptype;
 			prm->isnull = false;
 			found = true;
 			break;
@@ -1062,6 +1076,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 			prm->execPlan = NULL;
 			prm->value = heap_getattr(node->curTuple, i, tdesc,
 									  &(prm->isnull));
+			prm->ptype = tdesc->attrs[i-1]->atttypid;
 			i++;
 		}
 	}
@@ -1084,6 +1099,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 											true);
 		prm->execPlan = NULL;
 		prm->value = node->curArray;
+		prm->ptype = ptype;
 		prm->isnull = false;
 	}
 	else if (!found)
@@ -1096,6 +1112,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(false);
+			prm->ptype = ptype;
 			prm->isnull = false;
 		}
 		else
@@ -1108,6 +1125,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 				prm->execPlan = NULL;
 				prm->value = (Datum) 0;
+				prm->ptype = VOIDOID;
 				prm->isnull = true;
 			}
 		}
@@ -1238,3 +1256,47 @@ ExecAlternativeSubPlan(AlternativeSubPlanState *node,
 					   isNull,
 					   isDone);
 }
+
+/*
+ * ExecAndFormSerializeParamExec
+ *
+ * Execute the subplan stored in PARAM_EXEC param if it is not executed
+ * till now and form the serialized structure required for passing to
+ * worker backend.
+ */
+List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params)
+{
+	List	*lparam = NIL;
+	SerializedParamExecData *sparamdata;
+	ParamExecData *prm;
+	int		paramid;
+
+	paramid = -1;
+	while ((paramid = bms_next_member(params, paramid)) >= 0)
+	{
+		/*
+		 * PARAM_EXEC params (internal executor parameters) are stored in the
+		 * ecxt_param_exec_vals array, and can be accessed by array index.
+		 */
+		sparamdata = palloc0(sizeof(SerializedParamExecData));
+
+		prm = &(econtext->ecxt_param_exec_vals[paramid]);
+		if (prm->execPlan != NULL)
+		{
+			/* Parameter not evaluated yet, so go do it */
+			ExecSetParamPlan(prm->execPlan, econtext);
+			/* ExecSetParamPlan should have processed this param... */
+			Assert(prm->execPlan == NULL);
+		}
+
+		sparamdata->paramid	= paramid;
+		sparamdata->ptype = prm->ptype;
+		sparamdata->value = prm->value;
+		sparamdata->isnull = prm->isnull;
+
+		lparam = lappend(lparam, sparamdata);
+	}
+
+	return lparam;
+}
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index d544ad9..d8ca074 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1774,7 +1774,7 @@ spi_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		store tuple retrieved by Executor into SPITupleTable
  *		of current SPI procedure
  */
-void
+bool
 spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	SPITupleTable *tuptable;
@@ -1808,6 +1808,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 	(tuptable->free)--;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
new file mode 100644
index 0000000..39acda7
--- /dev/null
+++ b/src/backend/executor/tqueue.c
@@ -0,0 +1,304 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.c
+ *	  Use shm_mq to send & receive tuples between parallel backends
+ *
+ * A DestReceiver of type DestTupleQueue, which is a TQueueDestReciever
+ * under the hood, writes tuples from the executor to a shm_mq.
+ *
+ * A TupleQueueFunnel helps manage the process of reading tuples from
+ * one or more shm_mq objects being used as tuple queues.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/tqueue.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/tqueue.h"
+#include "miscadmin.h"
+
+typedef struct
+{
+	DestReceiver pub;
+	shm_mq_handle *handle;
+} TQueueDestReceiver;
+
+struct TupleQueueFunnel
+{
+	int		nqueues;
+	int		maxqueues;
+	int		nextqueue;
+	shm_mq_handle **queue;
+};
+
+/*
+ * Receive a tuple.
+ */
+static bool
+tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
+{
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+	HeapTuple	tuple;
+	shm_mq_result	result;
+
+	tuple = ExecMaterializeSlot(slot);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result == SHM_MQ_DETACHED)
+		return false;
+	else if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+
+	return true;
+}
+
+/*
+ * Prepare to receive tuples from executor.
+ */
+static void
+tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
+{
+	/* do nothing */
+}
+
+/*
+ * Clean up at end of an executor run
+ */
+static void
+tqueueShutdownReceiver(DestReceiver *self)
+{
+	/* do nothing */
+}
+
+/*
+ * Destroy receiver when done with it
+ */
+static void
+tqueueDestroyReceiver(DestReceiver *self)
+{
+	pfree(self);
+}
+
+/*
+ * Create a DestReceiver that writes tuples to a tuple queue.
+ */
+DestReceiver *
+CreateTupleQueueDestReceiver(void)
+{
+	TQueueDestReceiver *self;
+
+	self = (TQueueDestReceiver *) palloc0(sizeof(TQueueDestReceiver));
+
+	self->pub.receiveSlot = tqueueReceiveSlot;
+	self->pub.rStartup = tqueueStartupReceiver;
+	self->pub.rShutdown = tqueueShutdownReceiver;
+	self->pub.rDestroy = tqueueDestroyReceiver;
+	self->pub.mydest = DestTupleQueue;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
+
+	return (DestReceiver *) self;
+}
+
+/*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
+ * Create a tuple queue funnel.
+ */
+TupleQueueFunnel *
+CreateTupleQueueFunnel(void)
+{
+	TupleQueueFunnel *funnel = palloc0(sizeof(TupleQueueFunnel));
+
+	funnel->maxqueues = 8;
+	funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
+
+	return funnel;
+}
+
+/*
+ * Detach all tuple queues that belong to funnel.
+ */
+void
+TupleQueueFunnelShutdown(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		int		i;
+		shm_mq_handle *mqh;
+		shm_mq	   *mq;
+		for (i = 0; i < funnel->nqueues; i++)
+		{
+			mqh = funnel->queue[i];
+			mq = shm_mq_get_queue(mqh);
+			shm_mq_detach(mq);
+		}
+	}
+}
+
+/*
+ * Destroy a tuple queue funnel.
+ */
+void
+DestroyTupleQueueFunnel(TupleQueueFunnel *funnel)
+{
+	if (funnel)
+	{
+		pfree(funnel->queue);
+		pfree(funnel);
+	}
+}
+
+/*
+ * Remember the shared memory queue handle in funnel.
+ */
+void
+RegisterTupleQueueOnFunnel(TupleQueueFunnel *funnel, shm_mq_handle *handle)
+{
+	if (funnel->nqueues < funnel->maxqueues)
+	{
+		funnel->queue[funnel->nqueues++] = handle;
+		return;
+	}
+
+	if (funnel->nqueues >= funnel->maxqueues)
+	{
+		int newsize = funnel->nqueues * 2;
+
+		Assert(funnel->nqueues == funnel->maxqueues);
+
+		funnel->queue = repalloc(funnel->queue,
+								 newsize * sizeof(shm_mq_handle *));
+		funnel->maxqueues = newsize;
+	}
+
+	funnel->queue[funnel->nqueues++] = handle;
+}
+
+/*
+ * Fetch a tuple from a tuple queue funnel.
+ *
+ * We try to read from the queues in round-robin fashion so as to avoid
+ * the situation where some workers get their tuples read expediently while
+ * others are barely ever serviced.
+ *
+ * Even when nowait = false, we read from the individual queues in
+ * non-blocking mode.  Even when shm_mq_receive() returns SHM_MQ_WOULD_BLOCK,
+ * it can still accumulate bytes from a partially-read message, so doing it
+ * this way should outperform doing a blocking read on each queue in turn.
+ *
+ * The return value is NULL if there are no remaining queues or if
+ * nowait = true and no queue returned a tuple without blocking.  *done, if
+ * not NULL, is set to true when there are no remaining queues and false in
+ * any other case.
+ */
+HeapTuple
+TupleQueueFunnelNext(TupleQueueFunnel *funnel, bool nowait, bool *done)
+{
+	int	waitpos = funnel->nextqueue;
+
+	/* Corner case: called before adding any queues, or after all are gone. */
+	if (funnel->nqueues == 0)
+	{
+		if (done != NULL)
+			*done = true;
+		return NULL;
+	}
+
+	if (done != NULL)
+		*done = false;
+
+	for (;;)
+	{
+		shm_mq_handle *mqh = funnel->queue[funnel->nextqueue];
+		shm_mq_result result;
+		Size	nbytes;
+		void   *data;
+
+		/* Attempt to read a message. */
+		result = shm_mq_receive(mqh, &nbytes, &data, true);
+
+		/*
+		 * Normally, we advance funnel->nextqueue to the next queue at this
+		 * point, but if we're pointing to a queue that we've just discovered
+		 * is detached, then forget that queue and leave the pointer where it
+		 * is until the number of remaining queues fall below that pointer and
+		 * at that point make the pointer point to the first queue.
+		 */
+		if (result != SHM_MQ_DETACHED)
+			funnel->nextqueue = (funnel->nextqueue + 1) % funnel->nqueues;
+		else
+		{
+			--funnel->nqueues;
+			if (funnel->nqueues == 0)
+			{
+				if (done != NULL)
+					*done = true;
+				return NULL;
+			}
+
+			memmove(&funnel->queue[funnel->nextqueue],
+					&funnel->queue[funnel->nextqueue + 1],
+					sizeof(shm_mq_handle *)
+						* (funnel->nqueues - funnel->nextqueue));
+
+			if (funnel->nextqueue >= funnel->nqueues)
+				funnel->nextqueue = 0;
+
+			if (funnel->nextqueue < waitpos)
+				--waitpos;
+
+			continue;
+		}
+
+		/* If we got a message, return it. */
+		if (result == SHM_MQ_SUCCESS)
+		{
+			HeapTupleData htup;
+
+			/*
+			 * The tuple data we just read from the queue is only valid
+			 * until we again attempt to read from it.  Copy the tuple into
+			 * a single palloc'd chunk as callers will expect.
+			 */
+			ItemPointerSetInvalid(&htup.t_self);
+			htup.t_tableOid = InvalidOid;
+			htup.t_len = nbytes;
+			htup.t_data = data;
+			return heap_copytuple(&htup);
+		}
+
+		/*
+		 * If we've visited all of the queues, then we should either give up
+		 * and return NULL (if we're in non-blocking mode) or wait for the
+		 * process latch to be set (otherwise).
+		 */
+		if (funnel->nextqueue == waitpos)
+		{
+			if (nowait)
+				return NULL;
+			WaitLatch(MyLatch, WL_LATCH_SET, 0);
+			CHECK_FOR_INTERRUPTS();
+			ResetLatch(MyLatch);
+		}
+	}
+}
diff --git a/src/backend/executor/tstoreReceiver.c b/src/backend/executor/tstoreReceiver.c
index c1fdeb7..b0862ae 100644
--- a/src/backend/executor/tstoreReceiver.c
+++ b/src/backend/executor/tstoreReceiver.c
@@ -37,8 +37,8 @@ typedef struct
 } TStoreState;
 
 
-static void tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
-static void tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
 
 
 /*
@@ -90,19 +90,21 @@ tstoreStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the easy case where we don't have to detoast.
  */
-static void
+static bool
 tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
 
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the case where we have to detoast any toasted values.
  */
-static void
+static bool
 tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
@@ -152,6 +154,8 @@ tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 	/* And release any temporary detoasted values */
 	for (i = 0; i < nfree; i++)
 		pfree(DatumGetPointer(myState->tofree[i]));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1d3dd22..11d8191 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -382,6 +382,27 @@ _copySampleScan(const SampleScan *from)
 }
 
 /*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel    *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4239,6 +4260,9 @@ copyObject(const void *from)
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index c517dfd..0cf34db 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3412,3 +3412,25 @@ raw_expression_tree_walker(Node *node,
 	}
 	return false;
 }
+
+/*
+ * planstate_tree_walker
+ *
+ * This routine will invoke walker on the node passed.  This is a useful
+ * way of starting the recursion when the walker's normal change of state
+ * is not appropriate for the outermost PlanState node.
+ */
+bool
+planstate_tree_walker(Node *node,
+					  ParallelContext *pcxt,
+					  bool (*walker) (),
+					  void *context)
+{
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex plan */
+	check_stack_depth();
+
+	return walker(node, pcxt, context);
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 152e715..232b950 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -456,6 +456,16 @@ _outSampleScan(StringInfo str, const SampleScan *node)
 }
 
 static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -3005,6 +3015,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..e81afbd 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/*pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, same as in original Param */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+} SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,354 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part
+			 * of parameter (SerializedParamExternData) and length of
+			 * parameter value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional
+			 * space for them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is
+	 * no bind parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		* (int *) start_address = paramInfo->numParams;
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+
+		Assert (curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData*) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,i;
+	char	   *curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+						num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
+
+/*
+ * Estimate the amount of space required to serialize the PARAM_EXEC
+ * parameters.
+ */
+Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals)
+{
+	Size		size;
+	ListCell	*lparam;
+
+	/*
+	 * Add space required for saving number of PARAM_EXEC parameters
+	 * that needs to be serialized.
+	 */
+	size = sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		length;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		length = sizeof(SerializedParamExecData);
+
+		get_typlenbyval(param_val->ptype, &typLen, &typByVal);
+
+		/*
+		 * pass-by-value parameters are directly stored in
+		 * SerializedParamExternData, so no need of additional
+		 * space for them.
+		 */
+		if (!(typByVal || param_val->isnull))
+		{
+			length += datumGetSize(param_val->value, typByVal, typLen);
+			size = add_size(size, length);
+
+			/* Allow space for terminating zero-byte */
+			size = add_size(size, 1);
+		}
+		else
+			size = add_size(size, length);
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the PARAM_EXEC parameters into the memory, beginning at
+ * start_address.  maxsize should be at least as large as the value
+ * returned by EstimateExecParametersSpace.
+ */
+void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExecData *retval;
+	ListCell	*lparam;
+
+	/*
+	 * First, we store the number of PARAM_EXEC parameters that needs to
+	 * be serialized.
+	 */
+	if (serialized_param_exec_vals)
+		* (int *) start_address = list_length(serialized_param_exec_vals);
+	else
+	{
+		* (int *) start_address = 0;
+		return;
+	}
+
+	curptr = start_address + sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength, length;
+		const char	*s;
+		SerializedParamExecData* param_val = (SerializedParamExecData*) lfirst(lparam);
+
+		retval = (SerializedParamExecData*) curptr;
+
+		retval->paramid	= param_val->paramid;
+		retval->value = param_val->value;
+		retval->isnull = param_val->isnull;
+		retval->ptype = param_val->ptype;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(retval->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(retval->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(retval->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreExecParams
+ *		Restore PARAM_EXEC parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+List *
+RestoreExecParams(char *start_address)
+{
+	List			*lparamexecvals = NIL;
+	int				num_params, i;
+	char			*curptr;
+
+	num_params = * (int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExecData *nprm;
+		SerializedParamExecData	*outparam;
+		char	*s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExecData *) curptr;
+
+		outparam = palloc0(sizeof(SerializedParamExecData));
+
+		/* copy the parameter info */
+		outparam->isnull = nprm->isnull;
+		outparam->value = nprm->value;
+		outparam->paramid = nprm->paramid;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			outparam->value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+
+		lparamexecvals = lappend(lparamexecvals, outparam);
+	}
+
+	return lparamexecvals;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 23e0b36..e0fe8d5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -29,6 +29,7 @@
 #include <math.h>
 
 #include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
 #include "nodes/readfuncs.h"
 
 
@@ -1366,6 +1367,49 @@ _readTableSampleClause(void)
 	READ_DONE();
 }
 
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1505,6 +1549,10 @@ parseNodeString(void)
 		return_value = _readNotifyStmt();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7069f60..78d976a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,8 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *  cpu_tuple_comm_cost	Cost of CPU time to pass a tuple from worker to master backend
+ *  parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -102,11 +104,15 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int	parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -290,6 +296,42 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of funnel path.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 404c6f5..68d8837 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Funnel *create_funnel_plan(PlannerInfo *root,
+						FunnelPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -104,6 +106,9 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+					Index scanrelid, int nworkers,
+					Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -273,6 +278,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 			plan = create_unique_plan(root,
 									  (UniquePath *) best_path);
 			break;
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
 				 (int) best_path->pathtype);
@@ -560,6 +569,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1194,6 +1204,67 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path'.
+ */
+static Funnel *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path)
+{
+	Funnel  *funnel_plan;
+	Plan	*subplan;
+	List	*tlist;
+	RelOptInfo *rel = best_path->path.parent;
+	Index	scan_relid = best_path->path.parent->relid;
+
+	/*
+	 * For table scans, rather than using the relation targetlist (which is
+	 * only those Vars actually needed by the query), we prefer to generate a
+	 * tlist containing all Vars in order.  This will allow the executor to
+	 * optimize away projection of the table tuples, if possible.  (Note that
+	 * planner.c may replace the tlist we generate here, forcing projection to
+	 * occur.)
+	 */
+	if (use_physical_tlist(root, rel))
+	{
+			tlist = build_physical_tlist(root, rel);
+			/* if fail because of dropped cols, use regular method */
+			if (tlist == NIL)
+				tlist = build_path_tlist(root, &best_path->path);
+	}
+	else
+	{
+		tlist = build_path_tlist(root, &best_path->path);
+	}
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same
+	 * as either all the quals are pushed to subplan
+	 * (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	funnel_plan = make_funnel(tlist,
+							  subplan->qual,
+							  scan_relid,
+							  best_path->num_workers,
+							  subplan);
+
+	copy_path_costsize(&funnel_plan->scan.plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return funnel_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3462,6 +3533,27 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2467570..11f095e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -312,6 +312,52 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt	*
+create_parallel_worker_plannedstmt(Plan *plan,
+								   List *rangetable,
+								   int num_exec_params)
+{
+	PlannedStmt	*result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are
+	 * required by upper nodes in master backend.
+	 */
+	foreach(tlist, plan->targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = plan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = num_exec_params;
+	/*
+	 * Don't bother to set parameters used for invalidation as
+	 * worker backend plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ee8710d..12f6635 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -465,6 +465,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, (Node *) splan->tablesample, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel    *splan = (Funnel *) plan;
+
+				/*
+				 * target list for leftree of funnel plan should be same as for
+				 * funnel scan as both nodes need to produce same projection.
+				 * We don't want to do this assignment after fixing references
+				 * as that will be done separately for lefttree node.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
@@ -2265,6 +2284,40 @@ fix_opfuncids_walker(Node *node, void *context)
 }
 
 /*
+ * fix_node_funcids
+ *		Set the opfuncid (procedure OID) in an OpExpr node,
+ *		for plan tree.
+ *
+ * We need it mainly to fix the opfuncid in nodes of plantree
+ * after reading the planned statement by worker backend.
+ * Currently the support of nodes that could be executed by
+ * worker backend are limited, we can enhance this API based
+ * on it's usage in future.
+ */
+void
+fix_node_funcids(Plan *node)
+{
+	/*
+	 * do nothing when we get to the end of a leaf on tree.
+	 */
+	if (node == NULL)
+		return;
+
+	fix_opfuncids((Node*) node->qual);
+	fix_opfuncids((Node*) node->targetlist);
+
+	switch (nodeTag(node))
+	{
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	fix_node_funcids(node->lefttree);
+	fix_node_funcids(node->righttree);
+}
+
+/*
  * set_opfuncid
  *		Set the opfuncid (procedure OID) in an OpExpr node,
  *		if it hasn't been set already.
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d0bc412..073a7f5 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2243,6 +2243,10 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
+		case T_Funnel:
+			context.paramids = bms_add_members(context.paramids, scan_params);
+			break;
+
 		case T_IndexScan:
 			finalize_primnode((Node *) ((IndexScan *) plan)->indexqual,
 							  &context);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 935bc2b..276ad96 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -732,6 +732,32 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel, Path* subpath,
+				   Relids required_outer, int nworkers)
+{
+	FunnelPath	   *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+														  required_outer);
+	pathnode->path.pathkeys = NIL;	/* Funnel has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nworkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 000524d..4eb879b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
@@ -857,6 +858,12 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 
+	if (parallel_seqscan_degree >= MaxConnections)
+	{
+		write_stderr("%s: parallel_scan_degree must be less than max_connections\n", progname);
+		ExitPostmaster(1);
+	}
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 0e60dbc..c78f165 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -746,6 +746,15 @@ shm_mq_detach(shm_mq *mq)
 }
 
 /*
+ * Get the shm_mq from handle.
+ */
+shm_mq *
+shm_mq_get_queue(shm_mq_handle *mqh)
+{
+	return mqh->mqh_queue;
+}
+
+/*
  * Write bytes into a shared message queue.
  */
 static shm_mq_result
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index bcf3895..57014ee 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -34,6 +34,7 @@
 #include "commands/createas.h"
 #include "commands/matview.h"
 #include "executor/functions.h"
+#include "executor/tqueue.h"
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
@@ -44,9 +45,10 @@
  *		dummy DestReceiver functions
  * ----------------
  */
-static void
+static bool
 donothingReceive(TupleTableSlot *slot, DestReceiver *self)
 {
+	return true;
 }
 
 static void
@@ -129,6 +131,9 @@ CreateDestReceiver(CommandDest dest)
 
 		case DestTransientRel:
 			return CreateTransientRelDestReceiver(InvalidOid);
+
+		case DestTupleQueue:
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
@@ -162,6 +167,7 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -204,6 +210,7 @@ NullCommand(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
@@ -248,6 +255,7 @@ ReadyForQuery(CommandDest dest)
 		case DestCopyOut:
 		case DestSQLFunction:
 		case DestTransientRel:
+		case DestTupleQueue:
 			break;
 	}
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7598318..f1542a0 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,8 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/execParallel.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -1192,6 +1194,94 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc	*queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext	plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext
+	 * to be created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										"worker plan",
+										ALLOCSET_DEFAULT_MINSIZE,
+										ALLOCSET_DEFAULT_INITSIZE,
+										ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	receiver = CreateDestReceiver(DestTupleQueue);
+	SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	PopulateParamExecParams(queryDesc, parallelstmt->serialized_param_exec_vals);
+
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required
+	 * by plugins like pg_stat_statements to report the total usage for
+	 * statement execution.
+	 */
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested
+	 * by master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	if (!parallelstmt->inst_options)
+		(*receiver->rDestroy) (receiver);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 9c14e8a..f2fb638 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1121,7 +1121,13 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			if (!ok)
 				break;
 
-			(*dest->receiveSlot) (slot, dest);
+			/*
+			 * If we are not able to send the tuple, then we assume that
+			 * destination has closed and we won't be able to send any more
+			 * tuples so we just end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
 
 			ExecClearTuple(slot);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..e4751f0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -607,6 +607,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2545,6 +2547,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2721,6 +2733,26 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+						 "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..9f75a5b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -290,6 +290,8 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 0.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -500,6 +502,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 75e6b72..b3e3202 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -126,6 +126,7 @@ extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/printtup.h b/src/include/access/printtup.h
index 46c4148..92ec882 100644
--- a/src/include/access/printtup.h
+++ b/src/include/access/printtup.h
@@ -25,11 +25,11 @@ extern void SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist,
 
 extern void debugStartup(DestReceiver *self, int operation,
 			 TupleDesc typeinfo);
-extern void debugtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool debugtup(TupleTableSlot *slot, DestReceiver *self);
 
 /* XXX these are really in executor/spi.c */
 extern void spi_dest_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-extern void spi_printtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool spi_printtup(TupleTableSlot *slot, DestReceiver *self);
 
 #endif   /* PRINTTUP_H */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
new file mode 100644
index 0000000..b6a09dd
--- /dev/null
+++ b/src/include/executor/execParallel.h
@@ -0,0 +1,65 @@
+/*--------------------------------------------------------------------
+ * execParallel.h
+ *		POSTGRES parallel execution interface
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execParallel.h
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECPARALLEL_H
+#define EXECPARALLEL_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+#include "nodes/execnodes.h"
+#include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define	PARALLEL_KEY_PLANNEDSTMT	0
+#define	PARALLEL_KEY_PARAMS			1
+#define	PARALLEL_KEY_PARAMS_EXEC	2
+#define PARALLEL_KEY_BUFF_USAGE		3
+#define PARALLEL_KEY_INST_OPTIONS	4
+#define PARALLEL_KEY_INST_INFO		5
+#define PARALLEL_KEY_TUPLE_QUEUE	6
+#define PARALLEL_KEY_SCAN			7
+
+extern int	parallel_seqscan_degree;
+
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt		*plannedstmt;
+	ParamListInfo	params;
+	List			*serialized_param_exec_vals;
+	shm_mq_handle	*responseq;
+	int				inst_options;
+	char			*instrument;
+	char			*buffer_usage;
+} ParallelStmt;
+
+extern void InitializeParallelWorkers(PlanState *planstate,
+									  List *serialized_param_exec_vals,
+									  EState *estate,
+									  char **inst_options_space,
+									  char **buffer_usage_space,
+									  shm_mq_handle ***responseqp,
+									  ParallelContext **pcxtp,
+									  int nWorkers);
+extern shm_toc *GetParallelShmToc(void);
+extern bool ExecParallelEstimate(Node *node, ParallelContext *pcxt,
+								 Size *pscan_size);
+extern bool ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
+									  Size *pscan_size);
+extern bool ExecParallelBufferUsageAccum(Node *node);
+extern void ExecAssociateBufferStatsToDSM(BufferUsage *buf_usage,
+							  ParallelStmt *parallel_stmt);
+#endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 193a654..963e656 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -273,6 +273,8 @@ extern TupleDesc ExecCleanTypeFromTL(List *targetList, bool hasoid);
 extern TupleDesc ExecTypeFromExprList(List *exprList);
 extern void ExecTypeSetColNames(TupleDesc typeInfo, List *namesList);
 extern void UpdateChangedParamSet(PlanState *node, Bitmapset *newchg);
+extern void PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals);
 
 typedef struct TupOutputState
 {
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c9a2129..0c7847d 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+	InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..b996244
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *		prototypes for nodeFunnel.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void FinishParallelSetupAndAccumStats(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index 3732ad4..21c745e 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -24,4 +24,7 @@ extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
 
 extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
 
+extern List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params);
+
 #endif   /* NODESUBPLAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
new file mode 100644
index 0000000..ce16936
--- /dev/null
+++ b/src/include/executor/tqueue.h
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * tqueue.h
+ *	  prototypes for tqueue.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/tqueue.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef TQUEUE_H
+#define TQUEUE_H
+
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+
+/* Use this to send tuples to a shm_mq. */
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+						shm_mq_handle *handle);
+
+/* Use these to receive tuples from a shm_mq. */
+typedef struct TupleQueueFunnel TupleQueueFunnel;
+extern TupleQueueFunnel *CreateTupleQueueFunnel(void);
+extern void TupleQueueFunnelShutdown(TupleQueueFunnel *funnel);
+extern void DestroyTupleQueueFunnel(TupleQueueFunnel *funnel);
+extern void RegisterTupleQueueOnFunnel(TupleQueueFunnel *, shm_mq_handle *);
+extern HeapTuple TupleQueueFunnelNext(TupleQueueFunnel *, bool nowait,
+					 bool *done);
+
+#endif   /* TQUEUE_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5796de8..8f10c4e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -401,6 +403,18 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc		*toc;
+
+	/*
+	 * This is required to collect buffer usage stats from parallel
+	 * workers when requested by plugins.
+	 */
+	bool		total_time;	/* total time spent in ExecutorRun */
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
@@ -1047,6 +1061,11 @@ typedef struct PlanState
 	 * State for management of parameter-change-driven rescanning
 	 */
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
+	/*
+	 * This is required for parallel plan execution to fetch the
+	 * information from dsm.
+	 */
+	shm_toc			*toc;
 
 	/*
 	 * Other run-time state needed by most if not all node types.
@@ -1273,6 +1292,35 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle	**responseq;
+	TupleQueueFunnel *funnel;
+	char			*inst_options_space;
+	char			*buffer_usage_space;
+	bool			fs_workersReady;
+	bool			all_workers_done;
+	bool			local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodeFuncs.h b/src/include/nodes/nodeFuncs.h
index 7b1b1d6..df00d3d 100644
--- a/src/include/nodes/nodeFuncs.h
+++ b/src/include/nodes/nodeFuncs.h
@@ -13,6 +13,7 @@
 #ifndef NODEFUNCS_H
 #define NODEFUNCS_H
 
+#include "access/parallel.h"
 #include "nodes/parsenodes.h"
 
 
@@ -63,4 +64,7 @@ extern Node *query_or_expression_tree_mutator(Node *node, Node *(*mutator) (),
 extern bool raw_expression_tree_walker(Node *node, bool (*walker) (),
 												   void *context);
 
+extern bool planstate_tree_walker(Node *node, ParallelContext *pcxt,
+					  bool (*walker) (), void *context);
+
 #endif   /* NODEFUNCS_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 748e434..f456004 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -223,6 +225,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..21c6f7a 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -14,6 +14,8 @@
 #ifndef PARAMS_H
 #define PARAMS_H
 
+#include "nodes/pg_list.h"
+
 /* To avoid including a pile of parser headers, reference ParseState thus: */
 struct ParseState;
 
@@ -96,11 +98,47 @@ typedef struct ParamExecData
 {
 	void	   *execPlan;		/* should be "SubPlanState *" */
 	Datum		value;
+	/*
+	 * parameter's datatype, or 0.  This is required so that
+	 * datum value can be read and used for other purposes like
+	 * passing it to worker backend via shared memory.  This is
+	 * required only for evaluation of initPlan's, however for
+	 * consistency we set this for Subplan as well.  We left it
+	 * for other cases like CTE or RecursiveUnion cases where this
+	 * structure is not used for evaluation of subplans.
+	 */
+	Oid			ptype;
 	bool		isnull;
 } ParamExecData;
 
+/*
+ * This structure is used to pass PARAM_EXEC parameters to backend
+ * workers.  For each PARAM_EXEC parameter, pass this structure
+ * followed by value except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExecData
+{
+	int			paramid;			/* parameter id of this param */
+	Size		length;			/* length of parameter value */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+	Datum		value;
+	bool		isnull;
+} SerializedParamExecData;
+
 
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
+extern Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals);
+extern void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address);
+List *
+RestoreExecParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index cc259f1..69302af 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -296,6 +296,16 @@ typedef struct SampleScan
 	struct TableSampleClause *tablesample;
 } SampleScan;
 
+/* ------------
+ *		Funnel node
+ * ------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
 /* ----------------
  *		index scan node
  *
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 79bed33..f2faa1f 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -761,6 +761,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	    *subpath;	/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index dd43e45..994ea83 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,13 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We need some experiments to know what could be
+ * appropriate default values for parallel setup and startup
+ * cost.
+ */
+#define	DEFAULT_PARALLEL_SETUP_COST  0.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +55,11 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -70,6 +80,8 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+				RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 161644c..9d31b93 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,9 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+						RelOptInfo *rel, Path *subpath, Relids required_outer,
+						int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 52b077a..67a8582 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -133,6 +133,7 @@ extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
  */
 extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
 extern void fix_opfuncids(Node *node);
+extern void fix_node_funcids(Plan *node);
 extern void set_opfuncid(OpExpr *opexpr);
 extern void set_sa_opfuncid(ScalarArrayOpExpr *opexpr);
 extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index b10a504..dea968a 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt	*create_parallel_worker_plannedstmt(Plan *plan,
+											List *rangetable, int num_exec_params);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index 1a2ba04..7621a35 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -65,6 +65,9 @@ extern void shm_mq_set_handle(shm_mq_handle *, BackgroundWorkerHandle *);
 /* Break connection. */
 extern void shm_mq_detach(shm_mq *);
 
+/* Get the shm_mq from handle. */
+extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
+
 /* Send or receive messages. */
 extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
 			Size nbytes, const void *data, bool nowait);
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index 5bcca3f..91acd60 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -94,7 +94,8 @@ typedef enum
 	DestIntoRel,				/* results sent to relation (SELECT INTO) */
 	DestCopyOut,				/* results sent to COPY TO code */
 	DestSQLFunction,			/* results sent to SQL-language func mgr */
-	DestTransientRel			/* results sent to transient relation */
+	DestTransientRel,			/* results sent to transient relation */
+	DestTupleQueue				/* results sent to tuple queue */
 } CommandDest;
 
 /* ----------------
@@ -103,7 +104,9 @@ typedef enum
  *		pointers that the executor must call.
  *
  * Note: the receiveSlot routine must be passed a slot containing a TupleDesc
- * identical to the one given to the rStartup routine.
+ * identical to the one given to the rStartup routine.  It returns bool where
+ * a "true" value means "continue processing" and a "false" value means
+ * "stop early, just as if we'd reached the end of the scan".
  * ----------------
  */
 typedef struct _DestReceiver DestReceiver;
@@ -111,7 +114,7 @@ typedef struct _DestReceiver DestReceiver;
 struct _DestReceiver
 {
 	/* Called for each tuple to be output: */
-	void		(*receiveSlot) (TupleTableSlot *slot,
+	bool		(*receiveSlot) (TupleTableSlot *slot,
 											DestReceiver *self);
 	/* Per-executor-run initialization and shutdown: */
 	void		(*rStartup) (DestReceiver *self,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 96c5b8b..6f319c1 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -19,6 +19,7 @@
 #ifndef TCOPPROT_H
 #define TCOPPROT_H
 
+#include "executor/execParallel.h"
 #include "nodes/params.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -84,5 +85,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7a58ddb..3505d31 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
parallel_seqscan_partialseqscan_v17.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v17.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3701d8e..831329a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,12 +81,16 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan,
+								bool *pscan_finished);
+static void heap_parallelscan_initialize_startblock(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -226,7 +231,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -272,7 +280,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	else if (allow_sync && synchronize_seqscans)
 	{
 		scan->rs_syncscan = true;
-		scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+		if (scan->rs_parallel != NULL)
+			heap_parallelscan_initialize_startblock(scan);
+		else
+			scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
 	}
 	else
 	{
@@ -496,7 +507,32 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker won't
+				 * need to perform scan.  Report scan location before finishing the
+				 * scan so that the final state of the position hint is back at the
+				 * start of the rel.
+				 */
+				if (pscan_finished)
+				{
+					if (scan->rs_syncscan)
+						ss_report_location(scan->rs_rd, page);
+
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -519,6 +555,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -671,11 +710,22 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished = false;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+				finished = pscan_finished;
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+						   (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -773,7 +823,32 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker won't
+				 * need to perform scan.  Report scan location before finishing the
+				 * scan so that the final state of the position hint is back at the
+				 * start of the rel.
+				 */
+				if (pscan_finished)
+				{
+					if (scan->rs_syncscan)
+						ss_report_location(scan->rs_rd, page);
+
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -793,6 +868,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -934,11 +1012,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished = false;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+				finished = pscan_finished;
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+						   (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1341,7 +1430,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1351,7 +1440,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1360,7 +1449,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1369,7 +1458,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1378,7 +1467,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
 								   false, true, false);
 }
@@ -1386,6 +1475,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
@@ -1418,6 +1508,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1532,6 +1623,159 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = InvalidBlockNumber;
+	target->phs_firstpass = true;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize_startblock - initialize the startblock for
+ *					parallel scan.
+ *
+ *		Only the first worker of parallel scan will initialize the start
+ *		block for scan and others will use that information to indicate
+ *		the end of scan.
+ * ----------------
+ */
+static void
+heap_parallelscan_initialize_startblock(HeapScanDesc scan)
+{
+	ParallelHeapScanDesc parallel_scan;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	/*
+	 * InvalidBlockNumber indicates that this initialization is done for
+	 * first worker.
+	 */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock == InvalidBlockNumber)
+	{
+		scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+		parallel_scan->phs_cblock = scan->rs_startblock;
+		parallel_scan->phs_startblock = scan->rs_startblock;
+	}
+	else
+		scan->rs_startblock = parallel_scan->phs_startblock;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+}
+
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		Scanning till the position from where the parallel scan has started
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ *		Resets the current position for parallel scan to the begining of
+ *		relation, if next page to scan is greater than total number of pages in
+ *		relation.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan,
+						   bool *pscan_finished)
+{
+	BlockNumber	page = InvalidBlockNumber;
+	ParallelHeapScanDesc parallel_scan;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	*pscan_finished = false;
+
+	/* we treat InvalidBlockNumber specially here to avoid overflow */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock != InvalidBlockNumber)
+		page = parallel_scan->phs_cblock++;
+
+	if (page >= scan->rs_nblocks)
+	{
+		parallel_scan->phs_cblock = 0;
+		page = parallel_scan->phs_cblock++;
+	}
+
+	/*
+	 * scan position will be same as start position once during start
+	 * of scan and then at end of scan.
+	 */
+	if (parallel_scan->phs_firstpass && page == parallel_scan->phs_startblock)
+		parallel_scan->phs_firstpass = false;
+	else if (!parallel_scan->phs_firstpass && page == parallel_scan->phs_startblock)
+	{
+		*pscan_finished = true;
+		parallel_scan->phs_cblock--;
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot		snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 69d3b34..1a0a550 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -734,6 +734,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -942,6 +943,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_Funnel:
 			pname = sname = "Funnel";
 			break;
@@ -1095,6 +1099,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
@@ -1370,6 +1375,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2446,6 +2452,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 8037417..be1f47e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -20,8 +20,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tqueue.o tstoreReceiver.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4915151..dc45c20 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -161,6 +162,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_FunnelState:
 			ExecReScanFunnel((FunnelState *) node);
 			break;
@@ -473,6 +478,7 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_Funnel:
+		case T_PartialSeqScan:
 			return false;
 
 		case T_IndexScan:
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index 650fcc5..7a44462 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_PartialSeqScanState:
 		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 806f060..b1cbe7e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -17,6 +17,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/nodeFunnel.h"
+#include "executor/nodePartialSeqscan.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
 #include "optimizer/planner.h"
@@ -295,6 +296,24 @@ ExecParallelEstimate(Node *node, ParallelContext *pcxt,
 	 */
 	switch (nodeTag(node))
 	{
+		case T_ResultState:
+			{
+				PlanState *planstate = ((ResultState*)node)->ps.lefttree;
+
+				return planstate_tree_walker((Node*)planstate, pcxt,
+											 ExecParallelEstimate, pscan_size);
+			}
+		case T_PartialSeqScanState:
+			{
+				EState		*estate = ((PartialSeqScanState*)node)->ss.ps.state;
+
+				*pscan_size = heap_parallelscan_estimate(estate->es_snapshot);
+				shm_toc_estimate_chunk(&pcxt->estimator, *pscan_size);
+
+				/* key for paratial scan information. */
+				shm_toc_estimate_keys(&pcxt->estimator, 1);
+				return true;
+			}
 		default:
 			break;
 	}
@@ -311,6 +330,8 @@ bool
 ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
 						  Size *pscan_size)
 {
+	ParallelHeapScanDesc pscan;
+
 	if (node == NULL)
 		return false;
 
@@ -320,6 +341,25 @@ ExecParallelInitializeDSM(Node *node, ParallelContext *pcxt,
 	 */
 	switch (nodeTag(node))
 	{
+		case T_ResultState:
+			{
+				PlanState *planstate = ((ResultState*)node)->ps.lefttree;
+
+				return planstate_tree_walker((Node*)planstate, pcxt,
+											 ExecParallelInitializeDSM, pscan_size);
+			}
+		case T_PartialSeqScanState:
+			{
+				EState	*estate = ((PartialSeqScanState*)node)->ss.ps.state;
+
+				/* Store parallel heap scan descriptor in dynamic shared memory. */
+				pscan = shm_toc_allocate(pcxt->toc, *pscan_size);
+				heap_parallelscan_initialize(pscan,
+											 ((PartialSeqScanState*)node)->ss.ss_currentRelation,
+											 estate->es_snapshot);
+				shm_toc_insert(pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+				return true;
+			}
 		default:
 			break;
 	}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c181bf2..e24a439 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -197,6 +198,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_Funnel:
 			result = (PlanState *) ExecInitFunnel((Funnel *) node,
 												  estate, eflags);
@@ -422,6 +428,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_FunnelState:
 			result = ExecFunnel((FunnelState *) node);
 			break;
@@ -668,6 +678,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_FunnelState:
 			ExecEndFunnel((FunnelState *) node);
 			break;
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..c18dce0
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not
+	 * check are keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	shm_toc		*toc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+										   ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it
+	 * from shared memory.  We set 'toc' (place to lookup parallel scan
+	 * descriptor) as retrievied by attaching to dsm for parallel workers
+	 * whereas master backend stores it directly in partial scan state node
+	 * after initializing workers. 
+	 */
+	toc = GetParallelShmToc();
+	if (toc)
+		node->ss.ps.toc = toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize
+	 * it during ExecutorStart phase, however we need ParallelHeapScanDesc
+	 * to initialize the scan in case of this node and the same is
+	 * initialized by the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic shared
+		 * memory segment by master backend, parallel workers and local scan by
+		 * master backend retrieve it from shared memory.  If the scan descriptor
+		 * is available on first execution, then we need to re-initialize for
+		 * rescan.
+		 */
+		Assert(node->ss.ps.toc);
+	
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b348bfd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -75,6 +75,13 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Result node can be added as a gating node on top of PartialSeqScan
+	 * node, so need to percolate toc information to outer node.
+	 */
+	if (node->ps.toc)
+		outerPlanState(node)->toc = node->ps.toc;
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 11d8191..afed75e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -382,6 +382,22 @@ _copySampleScan(const SampleScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copyFunnel
  */
 static Funnel *
@@ -4260,6 +4276,9 @@ copyObject(const void *from)
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_Funnel:
 			retval = _copyFunnel(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 232b950..2c66490 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -456,6 +456,14 @@ _outSampleScan(StringInfo str, const SampleScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outFunnel(StringInfo str, const Funnel *node)
 {
 	WRITE_NODE_TYPE("FUNNEL");
@@ -3015,6 +3023,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_Funnel:
 				_outFunnel(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index e0fe8d5..e340543 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1412,6 +1412,81 @@ _readPlannedStmt(void)
 }
 
 /*
+ * _readPlan
+ */
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readScan
+ */
+static Scan *
+_readScan(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(PartialSeqScan);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
+
+/*
+ * _readResult
+ */
+static Result *
+_readResult(void)
+{
+	Plan *local_plan;
+	READ_LOCALS(Result);
+
+	local_plan = _readPlan();
+	local_node->plan.startup_cost = local_plan->startup_cost;
+	local_node->plan.total_cost = local_plan->total_cost;
+	local_node->plan.plan_rows = local_plan->plan_rows;
+	local_node->plan.plan_width = local_plan->plan_width;
+	local_node->plan.targetlist = local_plan->targetlist;
+	local_node->plan.qual = local_plan->qual;
+	local_node->plan.lefttree = local_plan->lefttree;
+	local_node->plan.righttree = local_plan->righttree;
+	local_node->plan.initPlan = local_plan->initPlan;
+	local_node->plan.extParam = local_plan->extParam;
+	local_node->plan.allParam = local_plan->allParam;
+	READ_NODE_FIELD(resconstantqual);
+
+	READ_DONE();
+}
+
+/*
  * parseNodeString
  *
  * Given a character string representing a node tree, parseNodeString creates
@@ -1553,6 +1628,10 @@ parseNodeString(void)
 		return_value = _readPlanInvalItem();
 	else if (MATCH("PLANNEDSTMT", 11))
 		return_value = _readPlannedStmt();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readScan();
+	else if (MATCH("RESULT", 6))
+		return_value = _readResult();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 78d976a..55da0c2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -296,6 +296,50 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_patialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_patialseqscan(Path *path, PlannerInfo *root,
+				   RelOptInfo *baserel, ParamPathInfo *param_info,
+				   int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan
+	 * via the ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * partial sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_funnel
  *	  Determines and returns the cost of funnel path.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..e813ba1
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path		*subpath;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast thousand pages to scan for each worker.
+	 * This number is somewhat arbitratry, however we don't want to
+	 * spawn workers to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+	
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as
+	 * they are visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	num_parallel_workers = parallel_seqscan_degree;
+	if (parallel_seqscan_degree <= estimated_parallel_workers)
+		num_parallel_workers = parallel_seqscan_degree;
+	else
+		num_parallel_workers = estimated_parallel_workers;
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the funnel path which master backend needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 68d8837..6c95341 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						List *tlist, List *scan_clauses);
 static Funnel *create_funnel_plan(PlannerInfo *root,
 						FunnelPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
@@ -106,6 +108,8 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+									Index scanrelid);
 static Funnel *make_funnel(List *qptlist, List *qpqual,
 					Index scanrelid, int nworkers,
 					Plan *subplan);
@@ -239,6 +243,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -365,6 +370,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												   scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -569,6 +581,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -1204,6 +1217,46 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_funnel_plan
  *
  * Returns a funnel plan for the base relation scanned by
@@ -3533,6 +3586,24 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static Funnel *
 make_funnel(List *qptlist,
 			List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 12f6635..39c35c6 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -442,6 +442,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -2308,6 +2309,11 @@ fix_node_funcids(Plan *node)
 
 	switch (nodeTag(node))
 	{
+		case T_Result:
+			fix_opfuncids((Node*) (((Result *)node)->resconstantqual));
+			break;
+		case T_PartialSeqScan:
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
 			break;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 073a7f5..37b5909 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2243,6 +2243,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
+		case T_PartialSeqScan:
 		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 276ad96..ef2725a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -732,6 +732,28 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_patialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_funnel_path
  *
  *	  Creates a path corresponding to a funnel scan, returning the
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3e3202..ead8411 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -121,11 +122,16 @@ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
 
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6e62319..f962f83 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,17 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber	phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	BlockNumber phs_startblock;
+	bool		phs_firstpass;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +60,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel; /* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..f97c706
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+											EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8f10c4e..9cb31b5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1292,6 +1292,16 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+/*
  * FunnelState extends ScanState by storing additional information
  * related to parallel workers.
  *		pcxt				parallel context for managing generic state information
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index f456004..bd87a84 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_PartialSeqScan,
 	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
@@ -100,6 +101,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_PartialSeqScanState,
 	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 69302af..2d25a01 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -296,6 +296,12 @@ typedef struct SampleScan
 	struct TableSampleClause *tablesample;
 } SampleScan;
 
+/* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
 /* ------------
  *		Funnel node
  * ------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 994ea83..7592560 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -80,6 +80,9 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_patialseqscan(Path *path, PlannerInfo *root,
+						RelOptInfo *baserel, ParamPathInfo *param_info,
+						int nworkers);
 extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
 				RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 9d31b93..a2b1f3d 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,8 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						Relids required_outer, int nworkers);
 extern FunnelPath *create_funnel_path(PlannerInfo *root,
 						RelOptInfo *rel, Path *subpath, Relids required_outer,
 						int nworkers);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..e7db9ab 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,14 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+								Relids required_outer);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
#310Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#309)
Re: Parallel Seq Scan

On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 23, 2015 at 7:43 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hi Amit,

The latest v16 patch cannot be applied to the latest
master as is.
434873806a9b1c0edd53c2a9df7c93a8ba021147 changed various
lines in heapam.c, so it probably conflicts with this.

Attached, find the rebased version of patch. It fixes the comments raised
by Jeff Davis and Antonin Houska. The main changes in this version are
now it supports sync scan along with parallel sequential scan (refer
heapam.c)
and the patch has been split into two parts, first contains the code for
Funnel node and infrastructure to support the same and second contains
the code for PartialSeqScan node and its infrastructure.

Thanks for the updated patch.

With subquery, parallel scan is having some problem, please refer below.

postgres=# explain select * from test01 where kinkocord not in (select
kinkocord from test02 where tenpocord = '001');
QUERY PLAN
--------------------------------------------------------------------------------------------------
Funnel on test01 (cost=0.00..155114352184.12 rows=20000008 width=435)
Filter: (NOT (SubPlan 1))
Number of Workers: 16
-> Partial Seq Scan on test01 (cost=0.00..155114352184.12
rows=20000008 width=435)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..130883.67 rows=385333 width=5)
-> Funnel on test02 (cost=0.00..127451.01
rows=385333 width=5)
Filter: (tenpocord = '001'::bpchar)
Number of Workers: 16
-> Partial Seq Scan on test02
(cost=0.00..127451.01 rows=385333 width=5)
Filter: (tenpocord = '001'::bpchar)
SubPlan 1
-> Materialize (cost=0.00..130883.67 rows=385333 width=5)
-> Funnel on test02 (cost=0.00..127451.01 rows=385333 width=5)
Filter: (tenpocord = '001'::bpchar)
Number of Workers: 16
-> Partial Seq Scan on test02 (cost=0.00..127451.01
rows=385333 width=5)
Filter: (tenpocord = '001'::bpchar)
(19 rows)

postgres=# explain analyze select * from test01 where kinkocord not in
(select kinkocord from test02 where tenpocord = '001');
ERROR: badly formatted node string "SUBPLAN :subLinkType 2 :testexpr"...
CONTEXT: parallel worker, pid 32879
postgres=#

And also regarding the number of workers (16) that is shown in the
explain analyze plan are not actually allotted because the in my
configuration i set the max_worker_process as 8 only. I feel the plan
should show the allotted workers not the planned workers.
If the query execution takes time because of lack of workers and the
plan is showing as 16 workers, in that case user may think that
even with 16 workers the query is slower, but actually it is not.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#311Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#310)
Re: Parallel Seq Scan

On Wed, Sep 9, 2015 at 2:17 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

And also regarding the number of workers (16) that is shown in the
explain analyze plan are not actually allotted because the in my
configuration i set the max_worker_process as 8 only. I feel the plan
should show the allotted workers not the planned workers.
If the query execution takes time because of lack of workers and the
plan is showing as 16 workers, in that case user may think that
even with 16 workers the query is slower, but actually it is not.

I would expect EXPLAIN should show the # of workers planned, and
EXPLAIN ANALYZE should show both the planned and actual values.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#312Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#310)
Re: Parallel Seq Scan

On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

With subquery, parallel scan is having some problem, please refer below.

postgres=# explain analyze select * from test01 where kinkocord not in
(select kinkocord from test02 where tenpocord = '001');
ERROR: badly formatted node string "SUBPLAN :subLinkType 2 :testexpr"...
CONTEXT: parallel worker, pid 32879
postgres=#

The problem here is that readfuncs.c doesn't have support for reading
SubPlan nodes. I have added support for some of the nodes, but it seems
SubPlan node also needs to be added. Now I think this is okay if the
SubPlan
is any node other than Funnel, but if Subplan contains Funnel, then each
worker needs to spawn other workers to execute the Subplan which I am
not sure is the best way. Another possibility could be store the results of
Subplan in some tuplestore or some other way and then pass those to workers
which again doesn't sound to be promising way considering we might have
hashed SubPlan for which we need to build a hashtable. Yet another way
could be for such cases execute the Filter in master node only.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#313Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#311)
Re: Parallel Seq Scan

On Wed, Sep 9, 2015 at 8:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 9, 2015 at 2:17 AM, Haribabu Kommi <kommi.haribabu@gmail.com>

wrote:

And also regarding the number of workers (16) that is shown in the
explain analyze plan are not actually allotted because the in my
configuration i set the max_worker_process as 8 only. I feel the plan
should show the allotted workers not the planned workers.
If the query execution takes time because of lack of workers and the
plan is showing as 16 workers, in that case user may think that
even with 16 workers the query is slower, but actually it is not.

I would expect EXPLAIN should show the # of workers planned, and
EXPLAIN ANALYZE should show both the planned and actual values.

Sounds sensible, will look into doing that way.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#314Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#312)
Re: Parallel Seq Scan

On Wed, Sep 9, 2015 at 11:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

With subquery, parallel scan is having some problem, please refer below.

postgres=# explain analyze select * from test01 where kinkocord not in
(select kinkocord from test02 where tenpocord = '001');
ERROR: badly formatted node string "SUBPLAN :subLinkType 2 :testexpr"...
CONTEXT: parallel worker, pid 32879
postgres=#

The problem here is that readfuncs.c doesn't have support for reading
SubPlan nodes. I have added support for some of the nodes, but it seems
SubPlan node also needs to be added. Now I think this is okay if the
SubPlan
is any node other than Funnel, but if Subplan contains Funnel, then each
worker needs to spawn other workers to execute the Subplan which I am
not sure is the best way. Another possibility could be store the results of
Subplan in some tuplestore or some other way and then pass those to workers
which again doesn't sound to be promising way considering we might have
hashed SubPlan for which we need to build a hashtable. Yet another way
could be for such cases execute the Filter in master node only.

IIUC, there are two separate issues here:

1. We need to have readfuncs support for all the right plan nodes.
Maybe we should just bite the bullet and add readfuncs support for all
plan nodes. But if not, we can add support for whatever we need.

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans. It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#315Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#314)
Re: Parallel Seq Scan

On Thu, Sep 10, 2015 at 4:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 9, 2015 at 11:07 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi <

kommi.haribabu@gmail.com>

wrote:

With subquery, parallel scan is having some problem, please refer

below.

postgres=# explain analyze select * from test01 where kinkocord not in
(select kinkocord from test02 where tenpocord = '001');
ERROR: badly formatted node string "SUBPLAN :subLinkType 2

:testexpr"...

CONTEXT: parallel worker, pid 32879
postgres=#

The problem here is that readfuncs.c doesn't have support for reading
SubPlan nodes. I have added support for some of the nodes, but it seems
SubPlan node also needs to be added. Now I think this is okay if the
SubPlan
is any node other than Funnel, but if Subplan contains Funnel, then each
worker needs to spawn other workers to execute the Subplan which I am
not sure is the best way. Another possibility could be store the

results of

Subplan in some tuplestore or some other way and then pass those to

workers

which again doesn't sound to be promising way considering we might have
hashed SubPlan for which we need to build a hashtable. Yet another way
could be for such cases execute the Filter in master node only.

IIUC, there are two separate issues here:

Yes.

1. We need to have readfuncs support for all the right plan nodes.
Maybe we should just bite the bullet and add readfuncs support for all
plan nodes. But if not, we can add support for whatever we need.

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans. It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

I also think that avoiding nested parallel plans is a good step forward.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#316Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#315)
Re: Parallel Seq Scan

On Thu, Sep 10, 2015 at 2:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 10, 2015 at 4:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 9, 2015 at 11:07 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Sep 9, 2015 at 11:47 AM, Haribabu Kommi
<kommi.haribabu@gmail.com>
wrote:

With subquery, parallel scan is having some problem, please refer
below.

postgres=# explain analyze select * from test01 where kinkocord not in
(select kinkocord from test02 where tenpocord = '001');
ERROR: badly formatted node string "SUBPLAN :subLinkType 2
:testexpr"...
CONTEXT: parallel worker, pid 32879
postgres=#

The problem here is that readfuncs.c doesn't have support for reading
SubPlan nodes. I have added support for some of the nodes, but it seems
SubPlan node also needs to be added. Now I think this is okay if the
SubPlan
is any node other than Funnel, but if Subplan contains Funnel, then each
worker needs to spawn other workers to execute the Subplan which I am
not sure is the best way. Another possibility could be store the
results of
Subplan in some tuplestore or some other way and then pass those to
workers
which again doesn't sound to be promising way considering we might have
hashed SubPlan for which we need to build a hashtable. Yet another way
could be for such cases execute the Filter in master node only.

IIUC, there are two separate issues here:

Yes.

1. We need to have readfuncs support for all the right plan nodes.
Maybe we should just bite the bullet and add readfuncs support for all
plan nodes. But if not, we can add support for whatever we need.

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans. It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

I also think that avoiding nested parallel plans is a good step forward.

I reviewed the parallel_seqscan_funnel_v17.patch and following are my comments.
I will continue my review with the parallel_seqscan_partialseqscan_v17.patch.

+ if (inst_options)
+ {
+ instoptions = shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+ *inst_options = *instoptions;
+ if (inst_options)

Same pointer variable check, it should be if (*inst_options) as per the
estimate and store functions.

+ if (funnelstate->ss.ps.ps_ProjInfo)
+ slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+ else
+ slot = funnelstate->ss.ss_ScanTupleSlot;

Currently, there will not be a projinfo for funnel node. So always it uses
the scan tuple slot. In case if it is different, we need to add the ExecProject
call in ExecFunnel function. Currently it is not present, either we can document
it or add the function call.

+ if (!((*dest->receiveSlot) (slot, dest)))
+ break;

and

+void
+TupleQueueFunnelShutdown(TupleQueueFunnel *funnel)
+{
+ if (funnel)
+ {
+ int i;
+ shm_mq_handle *mqh;
+ shm_mq   *mq;
+ for (i = 0; i < funnel->nqueues; i++)
+ {
+ mqh = funnel->queue[i];
+ mq = shm_mq_get_queue(mqh);
+ shm_mq_detach(mq);
+ }
+ }
+}

Using this function, the backend detaches from the message queue, so
that the workers
which are trying to put results into the queues gets an error message
as SHM_MQ_DETACHED.
Then worker finshes the execution of the plan. For this reason all the
printtup return
types are changed from void to bool.

But this way the worker doesn't get exited until it tries to put a
tuple in the queue.
If there are no valid tuples that satisfy the condition, then it may
take time for the workers
to exit. Am I correct? I am not sure how frequent such scenarios can occur.

+ if (parallel_seqscan_degree >= MaxConnections)
+ {
+ write_stderr("%s: parallel_scan_degree must be less than
max_connections\n", progname);
+ ExitPostmaster(1);
+ }

The error condition works only during server start. User still can set
parallel seqscan degree
more than max connection at super user session level and etc.

+ if (!parallelstmt->inst_options)
+ (*receiver->rDestroy) (receiver);

Why only when there is no instruementation only, the receiver needs to
be destroyed?

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#317Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#315)
Re: Parallel Seq Scan

On Thu, Sep 10, 2015 at 12:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans. It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

I also think that avoiding nested parallel plans is a good step forward.

Doing that as a part of the assess parallel safety patch was trivial, so I did.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#318Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#309)
Re: Parallel Seq Scan

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached, find the rebased version of patch. It fixes the comments raised
by Jeff Davis and Antonin Houska. The main changes in this version are
now it supports sync scan along with parallel sequential scan (refer
heapam.c)
and the patch has been split into two parts, first contains the code for
Funnel node and infrastructure to support the same and second contains
the code for PartialSeqScan node and its infrastructure.

+ if (es->analyze && nodeTag(plan) == T_Funnel)

Why not IsA()?

+ FinishParallelSetupAndAccumStats((FunnelState *)planstate);

Shouldn't there be a space before planstate?

+    /* inform executor to collect buffer usage stats from parallel workers. */
+    estate->total_time = queryDesc->totaltime ? 1 : 0;

Boy, the comment sure doesn't seem to match the code.

+         * Accumulate the stats by parallel workers before stopping the
+         * node.

Suggest: "Accumulate stats from parallel workers before stopping node".

+             * If we are not able to send the tuple, then we assume that
+             * destination has closed and we won't be able to send any more
+             * tuples so we just end the loop.

Suggest: "If we are not able to send the tuple, we assume the
destination has closed and no more tuples can be sent. If that's the
case, end the loop."

+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+                                 List *serialized_param_exec_vals,
+                                 int instOptions, Size *params_size,
+                                 Size *params_exec_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+                         List *serialized_param_exec_vals,
+                         int instOptions, Size params_size,
+                         Size params_exec_size,
+                         char **inst_options_space,
+                         char **buffer_usage_space);

Whitespace doesn't look like PostgreSQL style. Maybe run pgindent on
the newly-added files?

+/*
+ * This is required for parallel plan execution to fetch the information
+ * from dsm.
+ */

This comment doesn't really say anything. Can we get a better one?

+    /*
+     * We expect each worker to populate the BufferUsage structure
+     * allocated by master backend and then master backend will aggregate
+     * all the usage along with it's own, so account it for each worker.
+     */

This also needs improvement. Especially because...

+    /*
+     * We expect each worker to populate the instrumentation structure
+     * allocated by master backend and then master backend will aggregate
+     * all the information, so account it for each worker.
+     */

...it's almost identical to this one.

+     * Store bind parameter's list in dynamic shared memory.  This is
+     * used for parameters in prepared query.

s/bind parameter's list/bind parameters/. I think you could drop the
second sentence, too.

+    /*
+     * Store PARAM_EXEC parameters list in dynamic shared memory.  This is
+     * used for evaluation plan->initPlan params.
+     */

So is the previous block for PARAM_EXTERN and this is PARAM_EXEC? If
so, maybe that could be more clearly laid out.

+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,

Could this be a static function? Will it really be needed outside this file?

And is there any use case for letting some of the arguments be NULL?
Seems kind of an awkward API.

+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+    if (node == NULL)
+        return false;
+
+    switch (nodeTag(node))
+    {
+        case T_FunnelState:
+            {
+                FinishParallelSetupAndAccumStats((FunnelState*)node);
+                return true;
+            }
+            break;
+        default:
+            break;
+    }
+
+    (void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree, NULL,
+                                 ExecParallelBufferUsageAccum, 0);
+    (void) planstate_tree_walker((Node*)((PlanState *)node)->righttree, NULL,
+                                 ExecParallelBufferUsageAccum, 0);
+    return false;
+}

This seems wacky. I mean, isn't the point of planstate_tree_walker()
that the callback itself doesn't have to handle recursion like this?
And if not, then this wouldn't be adequate anyway, because some
planstate nodes have children that are not in lefttree or righttree
(cf. explain.c).

+    currentRelation = ExecOpenScanRelation(estate,
+                                           ((SeqScan *)
node->ss.ps.plan)->scanrelid,
+                                           eflags);

I can't see how this can possibly be remotely correct. The funnel
node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist).

+void ExecAccumulateInstInfo(FunnelState *node)

Another place where pgindent would help. There are a bunch of others
I noticed too, but I'm just mentioning a few here to make the point.

+ buffer_usage_worker = (BufferUsage *)(buffer_usage + (i *
sizeof(BufferUsage)));

Cast it to a BufferUsage * first. Then you can use &foo[i] to find
the i'th element.

+    /*
+     * Re-initialize the parallel context and workers to perform
+     * rescan of relation.  We want to gracefully shutdown all the
+     * workers so that they should be able to propagate any error
+     * or other information to master backend before dying.
+     */
+    FinishParallelSetupAndAccumStats(node);

Somehow, this makes me feel like that function is badly named.

+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+    READ_LOCALS(PlanInvalItem);
+
+    READ_INT_FIELD(cacheId);
+    READ_UINT_FIELD(hashValue);
+
+    READ_DONE();
+}

I don't see why we should need to be able to copy PlanInvalItems. In
fact, it seems like a bad idea.

+#parallel_setup_cost = 0.0  # same scale as above
+#define DEFAULT_PARALLEL_SETUP_COST  0.0

This value is probably a bit on the low side.

+int parallel_seqscan_degree = 0;

I think we should have a GUC for the maximum degree of parallelism in
a query generally, not the maximum degree of parallel sequential scan.

+    if (parallel_seqscan_degree >= MaxConnections)
+    {
+        write_stderr("%s: parallel_scan_degree must be less than
max_connections\n", progname);
+        ExitPostmaster(1);
+    }

I think this check is thoroughly unnecessary. It's comparing to the
wrong thing anyway, because what actually matters is
max_worker_processes, not max_connections. But in any case there is
no need for the check. If somebody stupidly tries an unreasonable
value for the maximum degree of parallelism, they won't get that many
workers, but nothing will break. It's no worse than setting any other
query planner costing parameter to an insane value.

--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -126,6 +126,7 @@ extern void heap_rescan_set_params(HeapScanDesc
scan, ScanKey key,
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);

+
extern bool heap_fetch(Relation relation, Snapshot snapshot,

Stray whitespace change.

More later, that's what I noticed on a first read through.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#319Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#318)
Re: Parallel Seq Scan

On Wed, Jul 22, 2015 at 10:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:

One thing I noticed that is a bit dismaying is that we don't get a lot
of benefit from having more workers. Look at the 0.1 data. At 2
workers, if we scaled perfectly, we would be 3x faster (since the
master can do work too), but we are actually 2.4x faster. Each
process is on the average 80% efficient. That's respectable. At 4
workers, we would be 5x faster with perfect scaling; here we are 3.5x
faster. So the third and fourth worker were about 50% efficient.
Hmm, not as good. But then going up to 8 workers bought us basically
nothing.

...sorry for bumping up this mail from July...

I don't think you meant to imply it, but why should we be able to
scale perfectly? Even when the table fits entirely in shared_buffers,
I would expect memory bandwidth to become the bottleneck before a
large number of workers are added. Context switching might also be
problematic.

I have almost no sense of whether this is below or above par, which is
what I'm really curious about. FWIW, I think that parallel sort will
scale somewhat better.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#320Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Robert Haas (#317)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 6:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 10, 2015 at 12:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans. It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

I also think that avoiding nested parallel plans is a good step forward.

Doing that as a part of the assess parallel safety patch was trivial, so I did.

I tried with latest HEAD code, seems to be problem is present in other
scenarios.

postgres=# explain select * from tbl a where exists (select 1 from tbl
b where a.f1=b.f1 limit 0);
QUERY PLAN
--------------------------------------------------------------------------------------
Funnel on tbl a (cost=0.00..397728310227.27 rows=5000000 width=214)
Filter: (SubPlan 1)
Number of Workers: 10
-> Partial Seq Scan on tbl a (cost=0.00..397727310227.27
rows=5000000 width=214)
Filter: (SubPlan 1)
SubPlan 1
-> Limit (cost=0.00..437500.00 rows=1 width=0)
-> Seq Scan on tbl b (cost=0.00..437500.00 rows=1 width=0)
Filter: (a.f1 = f1)
SubPlan 1
-> Limit (cost=0.00..437500.00 rows=1 width=0)
-> Seq Scan on tbl b (cost=0.00..437500.00 rows=1 width=0)
Filter: (a.f1 = f1)
(13 rows)

postgres=# explain analyze select * from tbl a where exists (select 1
from tbl b where a.f1=b.f1 limit 0);
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
LOG: worker process: parallel worker for PID 8775 (PID 9121) exited
with exit code 1
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
LOG: worker process: parallel worker for PID 8775 (PID 9116) exited
with exit code 1
LOG: worker process: parallel worker for PID 8775 (PID 9119) exited
with exit code 1
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
LOG: worker process: parallel worker for PID 8775 (PID 9117) exited
with exit code 1
LOG: worker process: parallel worker for PID 8775 (PID 9114) exited
with exit code 1
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
LOG: worker process: parallel worker for PID 8775 (PID 9118) exited
with exit code 1
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
CONTEXT: parallel worker, pid 9115
STATEMENT: explain analyze select * from tbl a where exists (select 1
from tbl b where a.f1=b.f1 limit 0);
LOG: worker process: parallel worker for PID 8775 (PID 9115) exited
with exit code 1
LOG: worker process: parallel worker for PID 8775 (PID 9120) exited
with exit code 1
ERROR: badly formatted node string "SUBPLAN :subLinkType 0 :testexpr"...
CONTEXT: parallel worker, pid 9115

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#321Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#316)
Re: Parallel Seq Scan

On Mon, Sep 14, 2015 at 11:04 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

Using this function, the backend detaches from the message queue, so
that the workers
which are trying to put results into the queues gets an error message
as SHM_MQ_DETACHED.
Then worker finshes the execution of the plan. For this reason all the
printtup return
types are changed from void to bool.

But this way the worker doesn't get exited until it tries to put a
tuple in the queue.
If there are no valid tuples that satisfy the condition, then it may
take time for the workers
to exit. Am I correct? I am not sure how frequent such scenarios can occur.

Yes, that's a problem. It's probably not that bad as long as the only
thing that can occur under a Funnel node is a sequential scan,
although even then the filter condition on the sequential scan could
be something expensive or highly selective. But it will get a lot
worse when we get the ability to push joins below the funnel.

I welcome ideas for solving this problem. Basically, the problem is
that we may need to shut down the executor before execution is
complete. This can happen because we're beneath a limit node; it can
also happen because we're on the inner side of a semijoin and have
already found one match. Presumably, parallel plans in such case will
be rare. But there may be cases where they happen, and so we need
some way to handle it.

One idea is that the workers could exit by throwing an ERROR, maybe
after setting some flag first to say, hey, this isn't a *real* error,
we're just doing this to achieve a non-local transfer of control. But
then we need to make sure that any instrumentation statistics still
get handled properly, which is maybe not so easy. And it seems like
there might be other problems with things not getting shut down
properly as well. Any code that expects a non-local exit to lead to a
(sub)transaction abort potentially gets broken by this approach.

Another idea is to try to gradually enrich the set of places that
check for shutdown. So for example at the beginning of ExecProcNode()
we could add a check at the beginning to return NULL if the flag's
been set; that would probably dampen the amount of additional work
that could get done in many common scenarios. But that might break a
bunch of things too, and it's far from a complete solution anyway: for
example, we could be stuck down inside some user-defined function, and
I don't see that there's much choice in that case to run the function
to conclusion.

This problem essentially happens because we're hoping that the workers
in parallel mode will "run ahead" of the master, producing tuples for
it to read before it gets to the point of sitting and waiting for
them. Indeed, if that happens, we've missed the boat entirely. But
then that opens up the problem that the master could always decide it
doesn't need any tuples after all.

Anyone have a smart idea for how to attack this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#322Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#309)
Re: Parallel Seq Scan

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

+ pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);

This is total nonsense. You can't hard-code the key that's used for
the scan, because we need to be able to support more than one parallel
operator beneath the same funnel. For example:

Append
-> Partial Seq Scan
-> Partial Seq Scan

Each partial sequential scan needs to have a *separate* key, which
will need to be stored in either the Plan or the PlanState or both
(not sure exactly). Each partial seq scan needs to get assigned a
unique key there in the master, probably starting from 0 or 100 or
something and counting up, and then this code needs to extract that
value and use it to look up the correct data for that scan.

+               case T_ResultState:
+                       {
+                               PlanState *planstate =
((ResultState*)node)->ps.lefttree;
+
+                               return
planstate_tree_walker((Node*)planstate, pcxt,
+
                  ExecParallelInitializeDSM, pscan_size);
+                       }

This looks like another instance of using the walker incorrectly.
Nodes where you just want to let the walk continue shouldn't need to
be enumerated; dispatching like this should be the default case.

+               case T_Result:
+                       fix_opfuncids((Node*) (((Result
*)node)->resconstantqual));
+                       break;

Seems similarly wrong.

+ * cost_patialseqscan

Typo. The actual function name has the same typo.

+       num_parallel_workers = parallel_seqscan_degree;
+       if (parallel_seqscan_degree <= estimated_parallel_workers)
+               num_parallel_workers = parallel_seqscan_degree;
+       else
+               num_parallel_workers = estimated_parallel_workers;

Use Min?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#323Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#320)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 6:29 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Sep 17, 2015 at 6:10 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Thu, Sep 10, 2015 at 12:12 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans. It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

I also think that avoiding nested parallel plans is a good step

forward.

Doing that as a part of the assess parallel safety patch was trivial,

so I did.

I tried with latest HEAD code, seems to be problem is present in other
scenarios.

As mentioned previously [1]/messages/by-id/CA+TgmobeqxZtP4crqtx36Mx7xtty-FsMFpuuRsVJOi8B6QRTGA@mail.gmail.com, we have to do two different things to make
this work, Robert seems to have taken care of one of those (basically
second point in mail[1]/messages/by-id/CA+TgmobeqxZtP4crqtx36Mx7xtty-FsMFpuuRsVJOi8B6QRTGA@mail.gmail.com) and still another one needs to be taken care
which is to provide support of reading subplans in readfuncs.c and that
will solve the problem you are seeing now.

[1]: /messages/by-id/CA+TgmobeqxZtP4crqtx36Mx7xtty-FsMFpuuRsVJOi8B6QRTGA@mail.gmail.com
/messages/by-id/CA+TgmobeqxZtP4crqtx36Mx7xtty-FsMFpuuRsVJOi8B6QRTGA@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#324Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#323)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 12:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As mentioned previously [1], we have to do two different things to make
this work, Robert seems to have taken care of one of those (basically
second point in mail[1]) and still another one needs to be taken care
which is to provide support of reading subplans in readfuncs.c and that
will solve the problem you are seeing now.

Thanks for the information.
During my test, I saw a plan change from parallel seq scan to seq scan
for the first reported query.
So I thought that all scenarios are corrected as not to generate the
parallel seq scan.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#325Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#317)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 10, 2015 at 12:12 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

2. I think it's probably a good idea - at least for now, and maybe
forever - to avoid nesting parallel plans inside of other parallel
plans. It's hard to imagine that being a win in a case like this, and
it certainly adds a lot more cases to think about.

I also think that avoiding nested parallel plans is a good step forward.

Doing that as a part of the assess parallel safety patch was trivial, so

I did.

As per my understanding, what you have done there will not prohibit such
cases.

+    * For now, we don't try to use parallel mode if we're running inside
+    * a parallel worker.  We might eventually be able to relax this
+    * restriction, but for now it seems best not to have parallel workers
+    * trying to create their own parallel workers.
+    */
+   glob->parallelModeOK = (cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
+       IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
+       parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
+       parse->utilityStmt == NULL && !IsParallelWorker() &&
+       !contain_parallel_unsafe((Node *) parse);

IIUC, your are referring to !IsParallelWorker() check in above code. If
yes,
then I think it won't work because we generate the plan in master backend,
parallel worker will never exercise this code. I have tested it as well
with
below example and it still generates SubPlan as Funnel.

CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 1000000) g;

set parallel_seqscan_degree=2;
set cpu_tuple_comm_cost=0.01;

explain select * from t1 where c1 not in (select c1 from t2 where c2 =
'xxxx');
QUERY PLAN

--------------------------------------------------------------------------------
----
Funnel on t1 (cost=11536.88..126809.17 rows=3432492 width=36)
Filter: (NOT (hashed SubPlan 1))
Number of Workers: 2
-> Partial Seq Scan on t1 (cost=11536.88..58159.32 rows=3432492
width=36)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Funnel on t2 (cost=0.00..11528.30 rows=3433 width=4)
Filter: (c2 = 'xxxx'::text)
Number of Workers: 2
-> Partial Seq Scan on t2 (cost=0.00..4662.68 rows=3433
width
=4)
Filter: (c2 = 'xxxx'::text)
SubPlan 1
-> Funnel on t2 (cost=0.00..11528.30 rows=3433 width=4)
Filter: (c2 = 'xxxx'::text)
Number of Workers: 2
-> Partial Seq Scan on t2 (cost=0.00..4662.68 rows=3433
width=4)
Filter: (c2 = 'xxxx'::text)
(17 rows)

Here the subplan is generated before the top level plan and while generation
of subplan we can't predict whether it is okay to generate it as Funnel or
not,
because it might be that top level plan is non-Funnel. Also if such a
subplan
is actually an InitPlan, then we are safe (as we execute the InitPlans in
master backend and then pass the result to parallel worker) even if top
level
plan is Funnel. I think the place where we can catch this is during the
generation of Funnel path, basically we can evaluate if any nodes beneath
Funnel node has 'filter' or 'targetlist' as another Funnel node, then we
have
two options to proceed:
a. Mark such a filter or target list as non-pushable which will indicate
that
they need to be executed only in master backend. If we go with this
option, then we have to make Funnel node capable of evaluating Filter
and Targetlist which is not a big thing.
b. Don't choose the current path as Funnel path.

I prefer second one as that seems to be simpler as compare to first and
there doesn't seem to be much benefit in going by first.

Any better ideas?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#326Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#316)
Re: Parallel Seq Scan

On Tue, Sep 15, 2015 at 8:34 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

I reviewed the parallel_seqscan_funnel_v17.patch and following are my

comments.

I will continue my review with the

parallel_seqscan_partialseqscan_v17.patch.

Thanks for the review.

+ if (inst_options)
+ {
+ instoptions = shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+ *inst_options = *instoptions;
+ if (inst_options)

Same pointer variable check, it should be if (*inst_options) as per the
estimate and store functions.

makes sense, will change in next version of patch.

+ if (funnelstate->ss.ps.ps_ProjInfo)
+ slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+ else
+ slot = funnelstate->ss.ss_ScanTupleSlot;

Currently, there will not be a projinfo for funnel node.

No, that's not true, it has projinfo for the cases where it is required.

So always it uses
the scan tuple slot. In case if it is different, we need to add the

ExecProject

call in ExecFunnel function.

It will not use Scan tuple slot for cases for column list contains
expression
or other cases where projection is required. Currently we don't need
separate
ExecProject as there is no case where workers won't do projection for us.
However in future we might need for something like restrictive functions.
I think it is better to add a comment explaining the same which I will do in
next version. Does that makes sense?

+ if (!((*dest->receiveSlot) (slot, dest)))
+ break;

and

+void
+TupleQueueFunnelShutdown(TupleQueueFunnel *funnel)
+{
+ if (funnel)
+ {
+ int i;
+ shm_mq_handle *mqh;
+ shm_mq   *mq;
+ for (i = 0; i < funnel->nqueues; i++)
+ {
+ mqh = funnel->queue[i];
+ mq = shm_mq_get_queue(mqh);
+ shm_mq_detach(mq);
+ }
+ }
+}

Using this function, the backend detaches from the message queue, so
that the workers
which are trying to put results into the queues gets an error message
as SHM_MQ_DETACHED.
Then worker finshes the execution of the plan. For this reason all the
printtup return
types are changed from void to bool.

But this way the worker doesn't get exited until it tries to put a
tuple in the queue.
If there are no valid tuples that satisfy the condition, then it may
take time for the workers
to exit. Am I correct? I am not sure how frequent such scenarios can

occur.

Yes, you are right. The main reason to keep it like this is that we
want to finish execution without going into error path, so that
the collected stats can be communicated back. I am not sure trying
to do better in this case is worth at this point.

+ if (parallel_seqscan_degree >= MaxConnections)
+ {
+ write_stderr("%s: parallel_scan_degree must be less than
max_connections\n", progname);
+ ExitPostmaster(1);
+ }

The error condition works only during server start. User still can set
parallel seqscan degree
more than max connection at super user session level and etc.

I think we can remove this check as pointed out by Robert as well.

+ if (!parallelstmt->inst_options)
+ (*receiver->rDestroy) (receiver);

Why only when there is no instruementation only, the receiver needs to
be destroyed?

No, receiver should be destroyed unconditionally. This is remnant of the
previous versions when receiver was created for no instrumentation case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#327Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#325)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 2:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As per my understanding, what you have done there will not prohibit such
cases.

+    * For now, we don't try to use parallel mode if we're running inside
+    * a parallel worker.  We might eventually be able to relax this
+    * restriction, but for now it seems best not to have parallel workers
+    * trying to create their own parallel workers.
+    */
+   glob->parallelModeOK = (cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
+       IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
+       parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
+       parse->utilityStmt == NULL && !IsParallelWorker() &&
+       !contain_parallel_unsafe((Node *) parse);

IIUC, your are referring to !IsParallelWorker() check in above code. If
yes,
then I think it won't work because we generate the plan in master backend,
parallel worker will never exercise this code. I have tested it as well
with
below example and it still generates SubPlan as Funnel.

You're right. That's still a good check, because some function called
in the worker might try to execute a query all of its own, but it
doesn't prevent the case you are talking about.

Here the subplan is generated before the top level plan and while generation
of subplan we can't predict whether it is okay to generate it as Funnel or
not,
because it might be that top level plan is non-Funnel. Also if such a
subplan
is actually an InitPlan, then we are safe (as we execute the InitPlans in
master backend and then pass the result to parallel worker) even if top
level
plan is Funnel. I think the place where we can catch this is during the
generation of Funnel path, basically we can evaluate if any nodes beneath
Funnel node has 'filter' or 'targetlist' as another Funnel node, then we
have
two options to proceed:
a. Mark such a filter or target list as non-pushable which will indicate
that
they need to be executed only in master backend. If we go with this
option, then we have to make Funnel node capable of evaluating Filter
and Targetlist which is not a big thing.
b. Don't choose the current path as Funnel path.

I prefer second one as that seems to be simpler as compare to first and
there doesn't seem to be much benefit in going by first.

Any better ideas?

I haven't studied the planner logic in enough detail yet to have a
clear opinion on this. But what I do think is that this is a very
good reason why we should bite the bullet and add outfuncs/readfuncs
support for all Plan nodes. Otherwise, we're going to have to scan
subplans for nodes we're not expecting to see there, which seems
silly. We eventually want to allow all of those nodes in the worker
anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#328Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#322)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 6:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

[ new patches ]

+ pscan = shm_toc_lookup(node->ss.ps.toc,

PARALLEL_KEY_SCAN);

This is total nonsense. You can't hard-code the key that's used for
the scan, because we need to be able to support more than one parallel
operator beneath the same funnel. For example:

Append
-> Partial Seq Scan
-> Partial Seq Scan

Okay, but I think the same can be achieved with this as well. Basic idea
is that each worker will work on one planned statement at a time and in
above case there will be two different planned statements and they will
store partial seq scan related information in two different loctions in
toc, although the key (PARALLEL_KEY_SCAN) would be same and I think this
will quite similar to what we are already doing for response queues.
The worker will work on one of those keys based on planned statement
which it chooses to execute. I have explained this in somewhat more details
in one of my previous mails [1]/messages/by-id/CAA4eK1LNt6wQBCxKsMj_QC+GahBuwyKWsQn6UL3nWVQ2savzwg@mail.gmail.com.

Each partial sequential scan needs to have a *separate* key, which
will need to be stored in either the Plan or the PlanState or both
(not sure exactly). Each partial seq scan needs to get assigned a
unique key there in the master, probably starting from 0 or 100 or
something and counting up, and then this code needs to extract that
value and use it to look up the correct data for that scan.

In that case also, multiple workers can worker on same key, assuming
in your above example, multiple workers will be required to execute
each partial seq scan. In this case we might need to see how to map
instrumentation information for a particular execution.

[1]: /messages/by-id/CAA4eK1LNt6wQBCxKsMj_QC+GahBuwyKWsQn6UL3nWVQ2savzwg@mail.gmail.com
/messages/by-id/CAA4eK1LNt6wQBCxKsMj_QC+GahBuwyKWsQn6UL3nWVQ2savzwg@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#329Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#309)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached, find the rebased version of patch.

Here are the performance test results:

Query selectivity HashAgg HashAgg
(million) + seqscan(ms) +
parallel seq scan(ms)
2
workers 4 workers 8 workers
$1 <= '001' 0.1 16717.00 7086.00
4459.00 2912.00
$1 <= '004' 0.4 17962.00 7410.00
4651.00 2977.00
$1 <= '008' 0.8 18870.00 7849.00
4868.00 3092.00
$1 <= '016' 1.5 21368.00 8645.00
6800.00 3486.00
$1 <= '030' 2.7 24622.00 14796.00 13108.00
9981.00
$1 <= '060' 5.4 31690.00 29839.00 26544.00
23814.00
$1 <= '080' 7.2 37147.00 40485.00 35763.00
32679.00

Table Size - 18GB
Total rows - 40 million

Configuration:
Shared_buffers - 12GB
max_wal_size - 5GB
checkpoint_timeout - 15min
work_mem - 1GB

System:
CPU - 16 core
RAM - 64GB

Query:
SELECT col1, col2,
SUM(col3) AS sum_col3,
SUM(col4) AS sum_col4,
SUM(col5) AS sum_col5,
SUM(col6) AS sum_col6
FROM public.test01
WHERE col1 <= $1 AND
col7 = '01' AND
col8 = '0'
GROUP BY col2,col1;

And also attached perf results for selectivity of 0.1 million and 5.4
million cases for analysis.

Regards,
Hari Babu
Fujitsu Australia

Attachments:

perf_reports.zipapplication/zip; name=perf_reports.zipDownload
#330Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#318)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 2:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

+    /*
+     * We expect each worker to populate the BufferUsage structure
+     * allocated by master backend and then master backend will aggregate
+     * all the usage along with it's own, so account it for each worker.
+     */

This also needs improvement. Especially because...

+    /*
+     * We expect each worker to populate the instrumentation structure
+     * allocated by master backend and then master backend will aggregate
+     * all the information, so account it for each worker.
+     */

...it's almost identical to this one.

I think we can combine them and have one comment.

+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,

Could this be a static function? Will it really be needed outside this

file?

It is already declared as static, but will add static in function definition
as well.

And is there any use case for letting some of the arguments be NULL?

In earlier versions of patch this API was used from other places, but now
there is no such use, so will change accordingly.

+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+    if (node == NULL)
+        return false;
+
+    switch (nodeTag(node))
+    {
+        case T_FunnelState:
+            {
+                FinishParallelSetupAndAccumStats((FunnelState*)node);
+                return true;
+            }
+            break;
+        default:
+            break;
+    }
+
+    (void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree,

NULL,

+                                 ExecParallelBufferUsageAccum, 0);
+    (void) planstate_tree_walker((Node*)((PlanState *)node)->righttree,

NULL,

+                                 ExecParallelBufferUsageAccum, 0);
+    return false;
+}

This seems wacky. I mean, isn't the point of planstate_tree_walker()
that the callback itself doesn't have to handle recursion like this?
And if not, then this wouldn't be adequate anyway, because some
planstate nodes have children that are not in lefttree or righttree
(cf. explain.c).

Will change according to recent commit for planstate_tree_walker

+    currentRelation = ExecOpenScanRelation(estate,
+                                           ((SeqScan *)
node->ss.ps.plan)->scanrelid,
+                                           eflags);

I can't see how this can possibly be remotely correct. The funnel
node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist).

This is mainly used for generating tuple descriptor and that tuple
descriptor will be used for forming scanslot, funnel node itself won't
do any scan. However, we can completely eliminate this InitFunnel()
function and use ExecAssignProjectionInfo() instead of
ExecAssignScanProjectionInfo() to form the projection info.

+ buffer_usage_worker = (BufferUsage *)(buffer_usage + (i *
sizeof(BufferUsage)));

Cast it to a BufferUsage * first. Then you can use &foo[i] to find
the i'th element.

Do you mean to say that the way code is written won't work?
Values of structure BufferUsage for each worker is copied into string
buffer_usage which I believe could be fetched in above way.

+    /*
+     * Re-initialize the parallel context and workers to perform
+     * rescan of relation.  We want to gracefully shutdown all the
+     * workers so that they should be able to propagate any error
+     * or other information to master backend before dying.
+     */
+    FinishParallelSetupAndAccumStats(node);

Somehow, this makes me feel like that function is badly named.

I think here comment seems to be slightly misleading, shall we
change the comment as below:

Destroy the parallel context to gracefully shutdown all the
workers so that they should be able to propagate any error
or other information to master backend before dying.

+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+    READ_LOCALS(PlanInvalItem);
+
+    READ_INT_FIELD(cacheId);
+    READ_UINT_FIELD(hashValue);
+
+    READ_DONE();
+}

I don't see why we should need to be able to copy PlanInvalItems. In
fact, it seems like a bad idea.

We are not copying PlanInvalItems, so I don't think this is required.
Now I don't exactly remember why I have added this at first place,
one reason could be that in earlier versions PlanInvalItems might
have been copied. Anyway, I will verify it once more and if not
required, I will remove it.

+#parallel_setup_cost = 0.0  # same scale as above
+#define DEFAULT_PARALLEL_SETUP_COST  0.0

This value is probably a bit on the low side.

How about keeping it as 10.0?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#331Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#329)
Re: Parallel Seq Scan

On Fri, Sep 18, 2015 at 1:33 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Attached, find the rebased version of patch.

Here are the performance test results:

Query selectivity HashAgg HashAgg
(million) + seqscan(ms) +
parallel seq scan(ms)
2
workers 4 workers 8 workers
$1 <= '001' 0.1 16717.00 7086.00
4459.00 2912.00
$1 <= '004' 0.4 17962.00 7410.00
4651.00 2977.00
$1 <= '008' 0.8 18870.00 7849.00
4868.00 3092.00
$1 <= '016' 1.5 21368.00 8645.00
6800.00 3486.00
$1 <= '030' 2.7 24622.00 14796.00 13108.00
9981.00
$1 <= '060' 5.4 31690.00 29839.00 26544.00
23814.00
$1 <= '080' 7.2 37147.00 40485.00 35763.00
32679.00

I think here probably when the selectivity is more than 5, then it should
not have selected Funnel plan. Have you by any chance changed
cpu_tuple_comm_cost? If not, then you can try by setting value of
parallel_setup_cost (may be 10) and then see if it selects the Funnel
Plan. Is it possible for you to check the cost difference of Sequence
and Funnel plan, hopefully explain or explain analyze should be sufficient?

And also attached perf results for selectivity of 0.1 million and 5.4
million cases for analysis.

I have checked perf reports and it seems that when selectivity is more, it
seems to be spending time in some kernel calls which could be due
communication of tuples.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#332Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#327)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 4:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 17, 2015 at 2:54 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Here the subplan is generated before the top level plan and while

generation

of subplan we can't predict whether it is okay to generate it as Funnel

or

not,
because it might be that top level plan is non-Funnel. Also if such a
subplan
is actually an InitPlan, then we are safe (as we execute the InitPlans

in

master backend and then pass the result to parallel worker) even if top
level
plan is Funnel. I think the place where we can catch this is during the
generation of Funnel path, basically we can evaluate if any nodes

beneath

Funnel node has 'filter' or 'targetlist' as another Funnel node, then we
have
two options to proceed:
a. Mark such a filter or target list as non-pushable which will indicate
that
they need to be executed only in master backend. If we go with this
option, then we have to make Funnel node capable of evaluating Filter
and Targetlist which is not a big thing.
b. Don't choose the current path as Funnel path.

I prefer second one as that seems to be simpler as compare to first and
there doesn't seem to be much benefit in going by first.

Any better ideas?

I haven't studied the planner logic in enough detail yet to have a
clear opinion on this. But what I do think is that this is a very
good reason why we should bite the bullet and add outfuncs/readfuncs
support for all Plan nodes. Otherwise, we're going to have to scan
subplans for nodes we're not expecting to see there, which seems
silly. We eventually want to allow all of those nodes in the worker
anyway.

makes sense to me. There are 39 plan nodes and it seems we have
support for all of them in outfuncs and needs to add for most of them
in readfuncs.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#333Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#329)
Re: Parallel Seq Scan

On Fri, Sep 18, 2015 at 4:03 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached, find the rebased version of patch.

Here are the performance test results:

Thanks, this is really interesting. I'm very surprised by how much
kernel overhead this shows. I wonder where that's coming from. The
writes to and reads from the shm_mq shouldn't need to touch the kernel
at all except for page faults; that's why I chose this form of IPC.
It could be that the signals which are sent for flow control are
chewing up a lot of cycles, but if that's the problem, it's not very
clear from here. copy_user_generic_string doesn't sound like
something related to signals. And why all the kernel time in
_spin_lock? Maybe perf -g would help us tease out where this kernel
time is coming from.

Some of this may be due to rapid context switching. Suppose the
master process is the bottleneck. Then each worker will fill up the
queue and go to sleep. When the master reads a tuple, the worker has
to wake up and write a tuple, and then it goes back to sleep. This
might be an indication that we need a bigger shm_mq size. I think
that would be experimenting with: if we double or quadruple or
increase by 10x the queue size, what happens to performance?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#334Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#328)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 11:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, but I think the same can be achieved with this as well. Basic idea
is that each worker will work on one planned statement at a time and in
above case there will be two different planned statements and they will
store partial seq scan related information in two different loctions in
toc, although the key (PARALLEL_KEY_SCAN) would be same and I think this
will quite similar to what we are already doing for response queues.
The worker will work on one of those keys based on planned statement
which it chooses to execute. I have explained this in somewhat more details
in one of my previous mails [1].

shm_toc keys are supposed to be unique. If you added more than one
with the same key, there would be no look up the second one. That was
intentional, and I don't want to revise it.

I don't want to have multiple PlannedStmt objects in any case. That
doesn't seem like the right approach. I think passing down an Append
tree with multiple Partial Seq Scan children to be run in order is
simple and clear, and I don't see why we would do it any other way.
The master should be able to generate a plan and then copy the part of
it below the Funnel and send it to the worker. But there's clearly
never more than one PlannedStmt in the master, so where would the
other ones come from in the worker? There's no reason to introduce
that complexity.

Each partial sequential scan needs to have a *separate* key, which
will need to be stored in either the Plan or the PlanState or both
(not sure exactly). Each partial seq scan needs to get assigned a
unique key there in the master, probably starting from 0 or 100 or
something and counting up, and then this code needs to extract that
value and use it to look up the correct data for that scan.

In that case also, multiple workers can worker on same key, assuming
in your above example, multiple workers will be required to execute
each partial seq scan. In this case we might need to see how to map
instrumentation information for a particular execution.

That was discussed on the nearby thread about numbering plan nodes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#335Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#330)
Re: Parallel Seq Scan

On Fri, Sep 18, 2015 at 6:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+    currentRelation = ExecOpenScanRelation(estate,
+                                           ((SeqScan *)
node->ss.ps.plan)->scanrelid,
+                                           eflags);

I can't see how this can possibly be remotely correct. The funnel
node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist).

This is mainly used for generating tuple descriptor and that tuple
descriptor will be used for forming scanslot, funnel node itself won't
do any scan. However, we can completely eliminate this InitFunnel()
function and use ExecAssignProjectionInfo() instead of
ExecAssignScanProjectionInfo() to form the projection info.

That sounds like a promising approach.

+ buffer_usage_worker = (BufferUsage *)(buffer_usage + (i *
sizeof(BufferUsage)));

Cast it to a BufferUsage * first. Then you can use &foo[i] to find
the i'th element.

Do you mean to say that the way code is written won't work?
Values of structure BufferUsage for each worker is copied into string
buffer_usage which I believe could be fetched in above way.

I'm just complaining about the style. If bar is a char*, then these
are all equivalent:

foo = (Quux *) (bar + (i * sizeof(Quux));

foo = ((Quux *) bar) + i;

foo = &((Quux *) bar)[i];

baz = (Quux *) bar;
foo = &baz[i];

+    /*
+     * Re-initialize the parallel context and workers to perform
+     * rescan of relation.  We want to gracefully shutdown all the
+     * workers so that they should be able to propagate any error
+     * or other information to master backend before dying.
+     */
+    FinishParallelSetupAndAccumStats(node);

Somehow, this makes me feel like that function is badly named.

I think here comment seems to be slightly misleading, shall we
change the comment as below:

Destroy the parallel context to gracefully shutdown all the
workers so that they should be able to propagate any error
or other information to master backend before dying.

Well, why does a function that destroys the parallel context have a
name that starts with FinishParallelSetup? It sounds like it is
tearing things down, not finishing setup.

+#parallel_setup_cost = 0.0  # same scale as above
+#define DEFAULT_PARALLEL_SETUP_COST  0.0

This value is probably a bit on the low side.

How about keeping it as 10.0?

Really? I would have guessed that the correct value was in the tens
of thousands.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#336Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#334)
Re: Parallel Seq Scan

On Fri, Sep 18, 2015 at 12:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 17, 2015 at 11:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, but I think the same can be achieved with this as well. Basic idea
is that each worker will work on one planned statement at a time and in
above case there will be two different planned statements and they will
store partial seq scan related information in two different loctions in
toc, although the key (PARALLEL_KEY_SCAN) would be same and I think this
will quite similar to what we are already doing for response queues.
The worker will work on one of those keys based on planned statement
which it chooses to execute. I have explained this in somewhat more details
in one of my previous mails [1].

shm_toc keys are supposed to be unique. If you added more than one
with the same key, there would be no look up the second one. That was
intentional, and I don't want to revise it.

I don't want to have multiple PlannedStmt objects in any case. That
doesn't seem like the right approach. I think passing down an Append
tree with multiple Partial Seq Scan children to be run in order is
simple and clear, and I don't see why we would do it any other way.
The master should be able to generate a plan and then copy the part of
it below the Funnel and send it to the worker. But there's clearly
never more than one PlannedStmt in the master, so where would the
other ones come from in the worker? There's no reason to introduce
that complexity.

Also, as KaiGai pointed out on the other thread, even if you DID pass
two PlannedStmt nodes to the worker, you still need to know which one
goes with which ParallelHeapScanDesc. If both of the
ParallelHeapScanDesc nodes are stored under the same key, then you
can't do that. That's why, as discussed in the other thread, we need
some way of uniquely identifying a plan node.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#337Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#331)
Re: Parallel Seq Scan

On Fri, Sep 18, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 18, 2015 at 1:33 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Attached, find the rebased version of patch.

Here are the performance test results:

Query selectivity HashAgg HashAgg
(million) + seqscan(ms) +
parallel seq scan(ms)
2
workers 4 workers 8 workers
$1 <= '001' 0.1 16717.00 7086.00
4459.00 2912.00
$1 <= '004' 0.4 17962.00 7410.00
4651.00 2977.00
$1 <= '008' 0.8 18870.00 7849.00
4868.00 3092.00
$1 <= '016' 1.5 21368.00 8645.00
6800.00 3486.00
$1 <= '030' 2.7 24622.00 14796.00 13108.00
9981.00
$1 <= '060' 5.4 31690.00 29839.00 26544.00
23814.00
$1 <= '080' 7.2 37147.00 40485.00 35763.00
32679.00

I think here probably when the selectivity is more than 5, then it should
not have selected Funnel plan. Have you by any chance changed
cpu_tuple_comm_cost? If not, then you can try by setting value of
parallel_setup_cost (may be 10) and then see if it selects the Funnel
Plan. Is it possible for you to check the cost difference of Sequence
and Funnel plan, hopefully explain or explain analyze should be sufficient?

Yes, I changed cpu_tuple_comm_cost to zero to observe how parallel seq scan
performs in high selectivity. Forgot to mention in the earlier mail.
Overall the
parallel seq scan performance is good.

And also attached perf results for selectivity of 0.1 million and 5.4
million cases for analysis.

I have checked perf reports and it seems that when selectivity is more, it
seems to be spending time in some kernel calls which could be due
communication of tuples.

Yes. And also in low selectivity with increase of workers, tas and
s_lock functions usage
is getting increased. May be these are also one of the reasons for
scaling problem.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#338Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Robert Haas (#333)
2 attachment(s)
Re: Parallel Seq Scan

On Sat, Sep 19, 2015 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 18, 2015 at 4:03 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Thu, Sep 3, 2015 at 8:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached, find the rebased version of patch.

Here are the performance test results:

Thanks, this is really interesting. I'm very surprised by how much
kernel overhead this shows. I wonder where that's coming from. The
writes to and reads from the shm_mq shouldn't need to touch the kernel
at all except for page faults; that's why I chose this form of IPC.
It could be that the signals which are sent for flow control are
chewing up a lot of cycles, but if that's the problem, it's not very
clear from here. copy_user_generic_string doesn't sound like
something related to signals. And why all the kernel time in
_spin_lock? Maybe perf -g would help us tease out where this kernel
time is coming from.

copy_user_generic_string system call is because of file read operations.
In my test, I gave the shared_buffers as 12GB with the table size of 18GB.

To reduce the user of copy_user_generic_string by loading all the pages into
shared buffers with different combinations of 12GB and 20GB shared_buffers
settings.

The _spin_lock calls are from the signals that are generated by the workers.
With the increase of tuple queue size, there is a change in kernel system
calls usage.

Here I attached the perf reports collected for your reference with -g option.

Some of this may be due to rapid context switching. Suppose the
master process is the bottleneck. Then each worker will fill up the
queue and go to sleep. When the master reads a tuple, the worker has
to wake up and write a tuple, and then it goes back to sleep. This
might be an indication that we need a bigger shm_mq size. I think
that would be experimenting with: if we double or quadruple or
increase by 10x the queue size, what happens to performance?

I tried with 1, 2, 4, 8 and 10 multiply factor for the tuple queue
size and collected
the performance readings. Summary of the results are:

- There is not much change in low selectivity cases with the increase
of tuple queue size.

- Till 1.5 million selectivity, the time taken to execute a query is 8
workers < 4 workers < 2 workers
with any tuple queue size.

- with tuple queue multiply factor 4 (i.e 4 * tuple queue size) for
selectivity greater than 1.5 million
4 workers < 2 workers < 8 workers

- with tuple queue multiply factor 8 or 10 for selectivity greater
than 1.5 million
2 workers < 4 workers < 8 workers

- From the above performance readings, increase of tuple queue size
gets benefited with lesser
number of workers compared to higher number of workers.

- May be the tuple queue size can be calculated automatically based on
the selectivity,
average tuple width and number of workers.

- when the buffers are loaded into shared_buffers using prewarm
utility, there is not much scaling
is visible with the increase of workers.

Performance report is attached for your reference.

Apart from the performance, I have the following observations.

Workers are getting started irrespective of the system load. If user
configures 16 workers, but
because of a sudden increase in the system load, there are only 2 or 3
cpu's are only IDLE.
In this case, if any parallel seq scan eligible query is executed, the
backend may start 16 workers
thus it can lead to overall increase of system usage and may decrease
the performance of the
other backend sessions?

If the query have two parallel seq scan plan nodes and how the workers
will be distributed across
the two nodes? Currently parallel_seqscan_degree is used per plan
node, even if we change that
to per query, I think we need a worker distribution logic, instead of
using all workers by a single
plan node.

Select with a limit clause is having a performance drawback with
parallel seq scan in some scenarios,
because of very less selectivity compared to seq scan, it should be
better if we document it. Users
can take necessary actions based on that for the queries with limit clause.

Regards,
Hari Babu
Fujitsu Australia

Attachments:

Parallel_seqscan_perf_test.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name=Parallel_seqscan_perf_test.xlsxDownload
PK!��X�z[Content_Types].xml �(��T�j�0��F�+I��'�.�6��kl����L���;vJ��i������e�y���*�%4�&�wE6u��<����������t�����`���W[LDA������Ja�<X>�\��g��W�\� ����L�%���C/��EI���o������^M��}iRE,T.�> ��,3)h�.*���P�����aDl�<���v�[W1W6��0���	�����m��#
�XzW{��R~�0�97����mM���R��t��o.�l^�+��5�-u�����?�A�s ���G��\�����
�%�B��m��.�'�91��<r�h��]d���@��>4�-���#�=�A0B���n�d����PK!�U0#�L_rels/.rels �(����N�0��H�C���nH���LH�!T�$����$@����Jc�����?[���iTb/N��(A�3b{�jx��V�b"gi��aW��l_x���b���������#b4O��r��0Q�ah���e���=��P-<��j{�>�<���4Mox/�}b�N�@�;�v�Cf�����B�I������"c�&�\O���8q"K��H��<��s@���.�h����<����Md�a��T_��PK!�	�(�xl/_rels/workbook.xml.rels �(����j�0������}q�ne�:��A�[�&Q���6��'o?�C�@�.��$�}?�������j�U�%)���Z�(�8�><� ������`@�Cq��Ns�D��$��%������`�)qm��.��c�uy�
�<M�2���b�)���p�6 N����k��nK|q�g���X�o�d9��� +�Z$��&�� ���7���`��dK0�5a����;��B��j�^�yZ��.�~
�������O	/�c)�w����b���PK!x�*�v�xl/workbook.xml��]o� ����?������XM���Yv��r�rj�������ts1�v����t~(4���5	��%`R+��%�m�����a���@B���|v}5���o����<�r��Os(����d�"`�v����9@(4�Q4f�P�����2���M�L8AhP��U��l�)
��#"��E���)���������5�h��\TJ��� P6�2����LT:��^G�����q���b�������6�H['�=vs>��n�%C�P�n8��A���"�x����M�hGbZk>����2t������K0nm���k"_���7Q8qK7�?1������ =��F�%��Q��Y�,:�����<�y��u?n�	��PK!�b�m��xl/theme/theme1.xml�YOo�6��w tom'�u�����M�n�i��XS�@�I}�����a���0l+��t�&[����HJ��K����D"|���#u�����C"$�q��]�z��>�8h{w��K�
�c�xL���H�����]��*$A�>�����J%����a��<!1�M����WT���U���f%�4�P�# {{2�>ACM�������J����&M�;��4B�e�	t�Y�>c~4$���
&�^������
�L1�bma]���u���t���(gZ��[Wvr���2���u{���`�M�,E���F���,���2�n�Q�����%�[�N��Je�D
�>����f}{����7����v��t�d��%|�J�Yw�2O�����~J=�L8�-�o|���(��<�4�	���X��}.��@����'d�}��.�F�b�o\��C�\����MT��0��z�����S���������t�����--g�.������~����?�~����xY����'���y92h!��/�������>����%�m�GE��FD�[��t3�q%'#q��Sg�v	��
��9fe�q�wW@�(^��wd�b�h	�a��8g.J
pC�*Xx8��r�bV�`|X���c���YU3J����8b�3+��(�������Q��u���K>Q�ELKM2�#'��vi~����vl�wu8+�z��HH�J����:�)����
~��L��\�E\O*�t@G�1��l�m��~C�*u��G.R(:-�ys^D��i7�QR��8,b?�SQ���*��q7C�;��+�}�����;4pDZ����_^'�����M01UJ�S�#�]�f�����l��m����g�D�^����<��	d����B����[_�W�����E)�*��k��;Z�xO(c5g��4���
h��A��:I~KBx��\ �Y�WQB�@�^�4�@���.��h�Kik<����6�b+��j�������9#U`���uM������DA��aV�B��[��f���-WY�������j0�[:�X�	�~��;������Q����t����>�z/��f�����"Z��x��
�Z��p;�����+�e�{/e��P;��,.&'��Q�k5��q��&pT��(�K�Lb�}���
�S��d����L17	jp�a����S!���

3���5'+�Z�zQ
�T��I�����Ivt]K&����#�v�5-�|����#4b3q���:TA�1�p�a*�~��9mm3��4���bg�1KB��[��Y&[�)H��� �V*�Q������ U�a�?SE�'p�>���vX`�3��q�BU(	���8���W�0
Aw��� ����9K��5�$�
����P�e�D�)�j��eI�������2�b��!aC]�zo�P�n�IZ�d���i������d��ks���|l2�Rn6
Mf�\��=X��v�Y���EE�����gY�
[A+M����[��XK�52�����`�%p�����������7�!?������&aQ}�6HH;8����`���i��I[-��/�����0���,�>�����e���E;��ck;������)
C�� cc��?f��}p�|6�1%M0��*����<���������PK!;m2K�B#xl/worksheets/_rels/sheet1.xml.rels�����0E��Cx{���CS7"�U�b�������{�e���p��6��<�f�,����ch��{�-�A�8��	-<�a�.��NNJ���X��Q$��~����	��>��I�y0������jm�_�/N��,�}W�:=RY���}<n���H���I9�`>�H9�E���bA�w��k}�m����	��PK!?�T�E�?xl/worksheets/sheet2.xml�[�n�8}_`������H��$�������^�����Y��=��{x�*�����vZ<.�������~�m�6��=�v�������l��9<��_����W�S1���������}{?�}{��������8��^����O�����c�X�6����>��(y>��3�{|Y�>�������m��q���w�skay������n���o����9n��g�����8u��������_�}��9�?`���mw�������//��������P�Mg��g`~�����9���m�����\�����=��������c�l�t�x�3�g��q��=;>��}>��oo���/��W
���~��i���h.fk|}�ZxSb�g����b��������1��q��}^{;����o�����"4�����i�Q@���f7�7�����N{H\�f����������L�Y��g�o��a�_[ ����C|�
���
0t�������J�?��2m�w�e"��B4������Ou1s?�w�C�y.�����k�^��w����H:}��KX��>j��<��0f'���Q�-��	6R!I�!��+���X_eRx{�U5)�}!������
�	o�V�/��/$�a4?]���d������
���=�?;��{j����l�����54�[Z(���X���m�V��f��MDu3E�q�F�	dL��|�����C��F �s���@�3H�!:��������o��6�{��V�:�tOV�'�{��0���p���k��b[��o��6q?����L{*S���G���R`E��Q��v&,FeH�"����"dC�2�������bD��������-,i�%��Jt���+���,ARw�0�����U!�H�IGW��XC�2)tje��a���Ry��3�9*oaN�����U��X�B[+W��+DY��c�Xt*d����Ty)!��������"*T�H�J�S'�Gc��%x��#i"?V������^��F@i�(/�K4�0���f��%hY���� �MeL�`��1�L5�K"�V��1&cD�8g6j�3�G�����,�NK@ec�P���0������a&&V��#�`#W9LfxJ#�\��)����H�)�,d�Jud56�L&�-�<��+j�	E)c����	�Z,U��X����o��e`���<���+J7+�1�N
U��(����N���2��n�P Q��P���������� �.9��R��[�M���M������j!�$�������G �lC\�!��l�O��L���M�=I���$��I���/=_,*4#�����$�q�Q9=�t��Uy��!����\���e����!���o�5AC��Q�$NC �A.�/�n)Z�$�i����(q�3�fR���C,q�V��E�8.�HfT�*c5"�FJ���5Y�Z�N�2J� �.6=��aJecYy�%��|�����)��Cb��q�f���4F?�`S�!�*
��2-��:$!6Q��	��YrC�6���C��h���I�r����MXQPM@)�`T��Yc�lZ��V�'Z6�-C�P�B	�}����R���K��V�	
ZW�C����Lf�&�,�?H$
s���s�WO���8L6�sK�2�j.:�J���+�M0Y�*���v]�#1O��d�\���RIT�o�pg[�L����*��P�n�m	5�$#h��n��X�Efl�f\+i1�d�+�:Z�w�5���1�Np^c��������d�Oy���:��MzC�c*����{M$B�%��r&�^5=��wr��i�8�F lN�#�@�?������7=G+�3{��=������=�&{G ,8�#�K~2=��a����C���`�aY7*���r3��3f�*�Sh��{�
,�,�F�����-$NE�/S�:����������C,�!�VV���8Fz,�z���5Ac�VDP!����e��M%C����������Z���|����5���#v��;������8f�&h�}l��f�nZ�m��&��z��/���)�T��i��X��=F6T���!WSl&��B�`��3k/���O��X������0�Np�"@@S�n����Y
��h���;�d�B��,Q���F~�s�������8s��[b[a���;������/e7����^9�%]��wy�
�7yM?p@[�CgR�H�I�hi�mU�9��OB�v��J��^����x4��I�d��h�:�2��\�������
I�#��,XS�������:���r&��E��AK
����G��&��L{]��p����"'��T��i����.����;�!~5a��0�i<����J��,��,���IO���	F�q�G |��@�����^�az
Ni�\y��#a���'��@r�[�����1;��Q�]��
AK�`Q�����24��e0����5&�P�c��X��/���qX�
����5AC�ak<l�����%��D8�������qdI������5��P�c��X��+g�c����t+����gM����0	K*�����Z���>!�Q���t$�rY�1���,v��D!�cE�j���&��u�%�n&�,f4�m>�����A�G��,"TO��n�&�D�1�6<Cm��<F37���1�&�i�!.�iVt�p�A3�oZI�]��C��R46�p6f��8}�,��
�H8�%u�����N����z�D���G|l�j2IaR�k-�\
W��
��1�w�*��	�w����\R������~"Y8$`l�-
\�BG�c����8-3d��
<-v�2t�O���{�t�U���^E�v���.��P��kc��Pv�kBy��f=a>�x���'����<���'����P<���kBy�����P��'\c�N_��*������+���8��5�N)�v���	Z�1�c8.�[��P.����������;��w����S�1A9�f^��|�r��C��Gq��	s�����c �
_C�2��)��O6�1L��W�H���e�a�eCYy�����r�]��q!5��& ,p"	��g(hZ��Z�yV����q�8BV�N��nz��G�-�����cW��pJ�{J3bM�)^g�^_��
[C�8�O1��hq^g;c[��f}�'�����x�&R'�sm��W���<F�8�����(�j,�|O�������d
�|��`���Y��H�������������X���5	�n��Y�M0�X2���@v����P}%���������z�y����N�
��y��niph��D����1mw�@�/8r������@v������?�N�x������f������.4.�����T�����5qW.<��1���}Ys�=���qO����~g�<����%�,_>B�@���
%�h�
%����eM8�����g���
%�h	�:,W�%���lqcah�1U�GlsK*�T�%�F^�FK�{�l�<��(��#�5��u��~SmX�������aI��.���1)����;,�PR���P�-������k?�x��c���u}|���fox�[�_������m�����)�����������
�-��od���s�]�g�����PK!���}m1@xl/worksheets/sheet3.xml�[�n�8}_`����(�E�d$=�,;��X����8��8k��g�~/"U%9����N����a����n���y�}{8���ws�����u���~����_�O�|v<�_���������q���������������f��z��?�No�����i��>f���+,�������������������ss���������#>�������o��l_O��a��>���O��c��e�w/���oo?m�/op�y��;����g/��/_^����g��7����������ns�����n����\��7����a�X�g�����^.����|�u�g��q�=;����}<-����r>�#�y��j��<��s8=n����l���[�E*���\;�o4r[�����Q��a��}\{>�s��o����BDg<X:�������3�v���;{��x����������n.���Y�g�o������ �]���C|�
��68z����k���Z�?n�r��w��� n������u���������TZM������]���n�3��t|[�9(�696��B�-�n����0�?����;�` �"M���������6llRV������h$�a\>���1����1u�4���pc���dl����|������g��G�6�����?������'�LhXp�(���`��r���i�����{$���w95����O8�'�s����G\�5�y!,��9>L' )d!6u���
��,/��ua{���8�6��W��+m�2�4������*�k9��r�pj\�Q��n>�)'��i\+b3h����{(�����2��������u�J'K�����}�\�L��s��-k)����u�b�K/�ph]PY�oD���X�g�*�2�e9Ly�
HK���uV����_��27�|w�AYJ��C��r�����9d/�9�a^�Y�'V�
zg]�`�TUOa>�p�VJ���l�t�}�J��"�3�9��������|	��#�I����e0{
Dn��Lc�]�Zee-c12V���d����|w]AT�E������x-��#��[#{�%Qf���`$*�MV��j|�����*��-�C���K�]�\�'6�������S�HwB�c2��	+�t��&`�+
E�:���R���N�N�{K�u.�L�)��mE�X~�U�D�)���h!�G(��$2�w	��@V����{�,���e0ke���n=D�o����2�����".eU�"�kp�?����e��-01`��R��D������8��������]���^H_��5�-���������!$1�r�e4�=���]��+�>[���K����f���@3����
�(����j�� �oi�%U|�Qu}�u.��ee���U��Y���UUZ�\��UQgR�i5P���A�Z�L�)���A���"���'�@�P��/��/�i���k"�3�ba��9�"�J�C��u�D6(0��A�Z��`3�c�Zg�L�'�)q�W[5���1M�x�X���������.����*GQ��"�
���V;�QP>J��Jh���(��"��Q'<
G�%;�T,���QW�&�x��4�n��Cp���W�L%2(����lG����ej�Rw��Q_m�uA��`���K�Xs�5Q@�����D�m������ ��14n�u)�|-�h�C�)&4%���2e]0���7sNs �(0Iq��/�N�5r�pEO�+��cW��X$6w�w-DI�:A)c����V��,�6sN�s`���n��i��l�-�JH����h]c8J}�Q46��3�:�����
����Y�/�5I���+�s���q��-�/�	H
B��$�� ���wx����D;d�8\����6\
W^g' )��i;9'��V=��L�6���hx�?g+��d�gl��mp��Z#��*�����}��,3e��u mq�"_v	R���r}���"�Rq!�C�`����I��HK��,�������oE�J�5������C+$`�$~bN�^_o����s��D���Q�f�!�Nb��5[�f�s��F%��Oh������2��������UV��)q�W]=Qu�8���X�g:WW�Q����]����t"��0*�=�����D�����Rw�Q/�� E��&b,ucq��:����N�/N�t�'`6����h��Z"��2ME����h��*��R@S���u�3����,�4sN(s ���)*���@�Tx� Rl8�W��*u�i~f�c ��2���v�b�7�?��5sNs�ze�2P�8?����S�b$��"ha�)����*�*�e_��aNj8�4�h��eA��pak�eN�t�\��������r��K;a������W(}�����fp�2P��+m����C�&�j������)����������^�D�W!i8�2��4,����|�|��-A����5��i#k<�D��A�1�4��'�L��
Z�+(�4M�x���/���UA��~�;O#��h	Nt���w��+�FE�Y.�jc���h����������\��K����k���Q�<���O�M:��-A�
u���g�Ac�W�iwzp���chl��4�i�]_Y�����o�X���7�s����,���
M<v(���(����WE�����(���xR=X
'v	u������94�F��c�+�`��	��4#��
k�$,�A�01��J��sG�v+{�1�S���]�)f���Fg%���
��b
�������j����f�$w-�2*��D�s�Q� #>R����@���E�]M�������*SgEz�B�����J	���� p`�����\�����i��$�J)cJ�]�k��h���/�4I�����s���
����d4�^P���YE/Q��P���������
W�wt�
W�����$M{�����W��n.4�e�*�`�S��9�Z�
%������A�����4%Nk<�M������v����;I�	���e���	��Z!�C.���s���X>6����u�gTu�
&�A�`���;���'D{snIQ^_^��a0��9d�,�9��=D�lQ�8$h]AtH=b��|�;������/�#�C��?)����u������*����p�X��z8�C�UP�X6�w>t�����b����KWM�3(�1D���Y�d]�#�+����a6R�c�+�`�dYl:�(���d8`S�=��s���0�d[�t�1rE���R�S���+�F]��0E=��.u��1�r�����������'3P`l���������+
�.�TblR����@	��"9ei�esN	�������z'�z��;�	��S�3��b.+�W\�c6��_�����l�f����s�;�/����q��������O��rx{���{�9fi`i&-KX����rp��^��=���-
,��'�I,,:�?{�/��{�cKK3i���b9i�V��������(�xEml�[|��������E�������7���jq�3�cKK3iY�������c��v��]���l]��^��g�,n_��J����v�5rwQ�y�������j��bk>{��O�l#�e�O���PK!f��q@xl/worksheets/sheet1.xml�[�n�8}_`�����"E�b$=�,��.������q:F'q�vw�����E��$�3���:*J���C������������?���E��g����~���v���������y��<^v���w������������nw�����v�x>�.���q��9%���,�������,N����������i�/�7�����<�����a��5�������99��6g<��q�z��=o���ys��������.>���������y�����p�|~�{�&�f������o���������=����E���7�{���}v�=�����-�|������������y����i�=���O��������y�{z��"��L�|>��[?������4�����w�QJ��m���htZ���{������������������<��(*Q�����������4�d�����s��7�Z7��������sY&y��\��l��t><���������7�o���*/��[7�{���;�R��M�����;�)�������ww[��W������Q$EQd*��#�v\�9o>�?fE�s_7fL�%���
:�@��v^ �n�'����Y|Gl=�Bd ���n���L�?�T
1�HC���1`cc"�s���Y|�z�����1���n�F��?������Fj�_����q&?��
8x�P�����P���~���f,�m��#E�~�N�d]���#��V{�Q��\�@b��0��`��u���k�w��BP�������]k�uj�Owe5���+�.�=�_�9�CbX��B.�$�����k^������D�I�yj�����&��u�K���"){�m}4��i�UIV�������2�@��*�(�%e��DdE�4O���k�2.(_�Ej�A_,V���JQ�-�
���IQ���b8x�\��e��N2U���
����D���#%���k3.(a�w|�uG�s%��&T�������9^Jg)��5Ag���(�����F/���	_S�����|	�t�j��=���=a�(�$��
�B���A���@�,�D�*P��
�E�gq�^"�l*a�#��w-;�!c�&��7{�T�e�&����J���2��\�	Z���L�. ���]�L�q�_��i�+��+1�X��=&��]'<�y����+��#���2��L1�I�2�T|Q�N�;�2-2�9��[�[T:��+A����5$�D��j�0��"a�����(Q+�XCA%�H���2P�)l����)s�����.c�S������qw��/y�^�@�u���%]���y�p�-����!�����u�t*��_�b
To^tM���S��]������_A��XRX�@b�iG �����U��Q���q"�e��7�49�5C7�J�zRE5��bM�P��[e��
�]Z�F])�~a�S��������Y���q��b������3LC0ZT�������G:0�xn	s��iK���������Y��^������f����>��k����(.E` e����L���t> ���j`!~:��LM����:\8`wX/
qo����=w��k�]������(DI195B\fE����,��eUb���,�]M�������f����w0����f���6$c�!���;*kqM������b��a���e�2��8�(w���2��q��D�1���7{�J�J��hl6S6D�$����+�T����RP�$K%�)�+�C�����M�N�=����fG>=�4���PP�5|5�jd���������)(����������t4����7Upp�H" �Q]�n�
V���u��Z���(��
4��;a���@X��$���J�)Q���rU������
\i���T����vr)N��U��]��a���H�[y��nX�@f�E��9�Ve�W^V�
-W�3����I�X3"_h�L/�jXTc��u�{����1WA��r04��)R��-���Z��JU"�����,��+E\�]G��^S�����<u�Q��]y�:�����z��hC��B?h:�{M���9VA���n):E��e6��r8����r��(��8����I4��h���vDC@��e�����>��E����2���O����u�N[L�����6���	S�&%Y/����t�5fX

?�^4\c�����!�K���S
���a��t!���T�1�$�7{����yvoW�5���"C>LYCk�Ia�<Q�
J�.�#���N�b2�
	��J!�ub�1�$�7;��LT��MC�*�0wg
�)FT��<8����\B/��aL3����5�>�U_/h����bN������C�[�l<�F �I�#��o�
����Y�����JO��4��[�w���$V#���������s�q$�
����OWi���z�J������"\tf�D��m	Fb�C>(��Cuz��n/F?�1�T�#�����;�
�A�coHU������7����,3�&tX5��a{�%��TI�BMcmz���4T �:����=u��I���!�����c!�����r!eDD��_Q!���~>S��+�u�DW�c��\o��a1��okh���N��wM ���e'�����e�O�j�3H�����t*��`�����6T���	S
�D�V�l�4�+(�RFQ��X�X��D#j�|-A��B����0��/wY�;V�k���s��sW"�$��
E���3�� Lr+�w�&u�Rt�b���H������77.c,E�sI�z�cL��&u����!h���9V?Hk����2y'��wK�y)�$Ua9$�)�Lr���s�����zm5p>��[ok`yK�@��jF ,��4p�
����:w�������\i���4��k��
�O����LT��t	a+o��.U�@�f���*K3��"���

,�,~d��24>��E���(�WW��G�-�u�8�b����7w�)I��I�yC����
�cZ�I:���o�������k	Z�l����bR_�r8���=#\��0>���y���|���q�XP�}8��B������|M��%��^V��ust5F4�nz�-F�-W�c��ao��a������t�A��P���5AC+,�]��@��J`x��)w��u��0.������1�
��7{��FkfXl�7�u����?d�S9�8��M�1{YHK�;<�*��6}
�0.a��j������V�%�
6+��d|?���s
�EZ^�q����[J�zCcB�
5l�-r�bk�7k������=0+v�e���`�|��S�-J���	��>P��g������"�pH�#��u'0�w�/���i�=|3a����M����j������������.W���k�Cowy����ZjX�Q�
����t�Bg���ry���CK
K=jY���`��l�z�Sby��CK
K=jY���`��������;��w8�6��,��sZp�oi��
-
,� ,���8��������e�r�=�(�94�i�������!o{���p�����8H��?s�z�p8����������u��y�?��������G��I�����x>n�g��������-�p������PK!!����xl/sharedStrings.xml��Mo�0���a@�
�-'��v�Xw�-��Pm��W��f�~J�����|��+�������r�JB���r���a�3�	X�T��VX�Zr]}�TX���U�$�s��8�M���+=���F���4]l���=��"N�dK��F���$�2���#.�#3J����p�����]U���{������v�����_Ch��`�.�T��l�u���aB�8
�����;f��c��M �����/J�Hzf>��O���8�8��%�q��|���A ��,���
�`�e�,�������`�3X���Zmx7�����Y��`[?����0���w�!$�k�e�~�&�lJ��k������G�O�I>)���Z����<k������1m�}��>�/lz��e��hy����`(M�H�i�4K������{%b������PK!�t9w

xl/styles.xml�V[o�0~�����&P�JR��H��jR��W�8��/����i�}�vT��n��������_�8���T���"����*A��Yo�m�(0��$hM4�J?~��Y3��$��B�-��'a��%�X���8)�����P���B[!��A]�S���	�OQ��zl�^.y�
]PF���B�'���
/@]��8�t;��zNs%�,��eY���E9�!hJ�R
��\6�$h�������"�G��+��K����Q���dR"�����x�k��BQ�VbN��o��f��)�f7C���I����G���Q�"AY��Ew�c�+
��ct�W�m��RC,)c��la#���Q""h��u
)���;�])��.vBg�'U�����������@8���kd
�iTjWR`�����V��3A�����W n�l0�l����f�a=��;����9HQN{�N}/_�kU��7�E�`|������v�c�	��!��
w!x@;XF�Cj-��j�Uy���>L '
\�lm����������gF+��gIc�:��R���*�s����<���<+\���yh������y�t��i[����
���/���e�-�m����'id�Z�u1�T���V��F��/{�$��������[4�*6��_ �hv
�Yu����� 7���O��
�
R�����0A��7��H���>I�T$h����im@��i���
E��f�i<���Q4�����7���z���l���At���/�	<O����^f0���f�.R��-���^�v�]ur�9���1���PK!j�C��docProps/app.xml �(���Mo!����������,�(u���KvR)�j��zi0 `�v~}�]9^�>D�
3//���/w[M�AY���8����2����oF(	L����o����}TH�0!�U�n�X�n!�S��Ji�bZ�
�e�$^[Yo�D6���wM������q���5-�l���z���W�i%!�.�W%�
����N��lX��n���*�E��p�W4.��(A�������-A� x�
�h=	�9]���G�����������Ek�?�
b�%A����v�1�)8�=H*�"�U���K���lH�1��=N��c���.�4;�����w�X�<j�����{^����4k�9�����[���7�/CM��t���o��< �4��d��2O����5D<<�i�����l���Mo�uk���l�8h�-�cy��=1�g�,M� �����?��PK!yyS`��'xl/printerSettings/printerSettings1.bin�T�J�@=�T�/�.D����P,�n>Fj�M��i@q���+W����'�
�~���?�{f�1A������3����a���h��+��q�S���
l��>���g������\�)�Ba���1Q���Y5����Y�������-	/��`�o��4SC
�J���_�U�8`N��=�c���!�>bQ*���@�z*��tB����xN|���:�=�E��h>+������6D�Q�v�1��m��|�x��C��.�
K��5�r���r7t�3S�s��T�-x���GRhU����_��>>���nF(�O��(J�8��bB����|���Km������?��=������b���u���	~O���p&��#�ZP���������/��PK!����FgdocProps/core.xml �(���_K�0���C��]����T�����[H��`��$�����[��>���_��$��d}�uB���E���Bm
��Z�39O��VP�=84///rf���6`��r��m�7c�� ��
�Z[I}8�
6�}�
�,I�X���z�`lz": 9����V-�3HP��t��_�+��Ze����M�t�;ds���{�Do��zT��!���O/m�X�fWP�sF���-���*z�V�x0ovXQ��a�k�vj=��-���G!�J�����j��,I'qrg�*I�uB�������M�n �I����dC�P���k�?��PK-!��X�z[Content_Types].xmlPK-!�U0#�L�_rels/.relsPK-!�	�(��xl/_rels/workbook.xml.relsPK-!x�*�v�	xl/workbook.xmlPK-!�b�m���
xl/theme/theme1.xmlPK-!;m2K�B#�xl/worksheets/_rels/sheet1.xml.relsPK-!?�T�E�?�xl/worksheets/sheet2.xmlPK-!���}m1@!xl/worksheets/sheet3.xmlPK-!f��q@�/xl/worksheets/sheet1.xmlPK-!!����J>xl/sharedStrings.xmlPK-!�t9w

7@xl/styles.xmlPK-!j�C���CdocProps/app.xmlPK-!yyS`��'�Fxl/printerSettings/printerSettings1.binPK-!����Fg[HdocProps/core.xmlPK��J
perf_reports.zipapplication/zip; name=perf_reports.zipDownload
PK��6G
perf_reports/PK��6Gperf_reports/12GB_preload/PK��6G0rr�e7�_perf_reports/12GB_preload/ps_0_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]{��4��Oa	��
y?���e��[�B(J��MB�.,��qM���'n����n[��L����x<~�|���x��^Z�
H�O�/)��)QU������NTY1��M�����}����
&9���z9�X�LIS/4��TMR%�2�{�tMz�4���(����(��
/�o��!������U��)f��w�E��3�$��o��]����3Mb����?�+�B7
���C ��?����KQ_���%�W�|�|W7����H
oE�t�d�zS�]�ur�1��:hhz����Q�^�s3�'Y@.��"%�l��F�;��q�+n~@�E�wI�������O�0Z���7���[�}K���O����[��[o�Xm�]��z��KR 0���u{����T~~��'W/\����7�~��7��B�8Lj^{�}��0���;$��t�=t�����>�}��Ixs��S���Dw� I�����WoA���?)����4��{����f�f�����7 r7�!��z��������c����w�d�u��^$+����5|q6w/��=I1$�|FH
C��h�����W�7�9���K���WDy���//{���~��{wqq�����%�6��;/������v���!��l�G��L��J��V��a���
��iI\��]��f��f/�����t�^a��n��z~�d����U?/J���^A���?�x��
������]84hs���������6l������J�
q�3���
]r�2����
h�dk�`�\�`� l�1,57^~����4����q���UI9�9��;�������Y��g(�+��(���6}��:��������~�~���z���n����?��:��X��j5��xw������O_&��G�e��!���]�W���g���C:T:��:Nm���^�
,bz�3������"c	����W�-����\�4	�:��E�G�!��0�%�X���`��+�l�	��m�
!��%d������?!-��Y���|�{;CD���H6Z�Q0��L��"b9KmK��g-�
&�H~�
��J�-�t���vt� ��t>2�$i1�m��k�'sip�~\�
	�1�3�Hp?B�T�Q�n���n��)�d��]>���p9��d�}$�N�#�qO���Q�$k�XZ��'�����*��j�Fp7��2�F����������c��S�(����I�| ,;zT��b�w�g�z%*��f^��EF)X���L]��`(E�������[���v2PF|�=Gk�"���f��Y�j����/��J�i��s��,n�9�o�uaT������Os��q��!(��`� r,H�t��0Z�Ouq�+����L����W'���������S�9$�C��d��w���h���kx����e�zy�~���������+�U�KC�"���� ,���2
 HX����h�E�!��ql����]/j����g%�m�����C*����)THe�x`d����!�#C8G���L����nz��Z�I������ 4����5��k�����k[��tR,��v�`)�S|t
������$Kx2��������d�\;W�U�\�iG��/Q!���X�g2'�Y�e.G������~=M�
y�����,����"oi��������3�`k}���O��6JLt����	;�.x��h���F��i�q��g���:��z�$c^&	�#`[ejIeD���A����T���k����Ra�&x������H���V$�pa�^��_�-
4�A�hB'��74N^@��M���i��Q����.��`��mS�.�$�sd^"z�-����bX?;U�n����hst�����ah�uz[+��*��3�����3^�����N��S�.c�v�t�k�	���'1��.�%sN>M�k���V�_�p�`�U������@�hC�^�������p�&����*�K��G�������@��E�%-gi�ml�=LL�>�����`-<O�q?B����A���#5��t��iP=;bCm���u��h"��E@�dp��2�l����8�,D��|�������X��`A4{��f�X��q���bE�(	�B������>�� T>�La���sAt���Fg��D1���(����ev��xN����{���(�c��v3B���V-��[�0S4[6g��$4�������!�DS��Q,E��F`������
�8��K��������%<C��'�Be�:RUa�W4ka���-�"\���u�
�V�]�p���7J�%
sP7a&o� T�yb��/��daS�Uq�:��U����0H��-
�O�'����
E�=q�TBaA���%���vt4�����@X)����(f`k���,|a9	
�F�wI��e�� ��o	;��B5Kx�u�Q���E�@XR#T�)Q����Y����X��#hH��<a������>t�'�sE��R�\q���E`���m+�R�E��e��LC���B��'�%[��zcQDa��"f��D�ab�Ha�
@�U
6�����`�����!c�
���.�T`'��(x�����K����;1���N�8tZ�,�>�M��C���}��(��z-j6*4tM���k���b��Q�N���N�������E��@��Y�0��;�����`��A�hA��m�dg���O0K�P$G�<Gu��M���&��57��1��{�$S8�����SL5|��A�%������'QZV��*#��dm�=����
�s56�>������d��$�vX}�1��,��s��'_��5X� �A��'�h��4a�\:����D����$:�ql��<g�_%=1e������]�H�*��$o����BG�����
A����3�f����ah�4zl�0���S�B����v>!���1g��9���"�g3��Z�5��)�c�%�PrT�9�0"��QHEU@?X�z�	G��e2v��nk���"�,���^��e;��j���`�8Nc�,��J�x+eC]�Zz������t����Cln�bu"w�i|3r��pV��_�����.��J�i�ME�>h����H�r��G!{��O}%���$��B������5=@�4�����|��^
�D�� ��y���d����X�a�b6�R&"p�_����Q��]<`f�gt������f��YZn����I%�%�z�}�"2��^�Y2�������	��|�LC�e�,�������>��JA�g�C����s�zc�`�����W��g��7w`5�m�����%���*.������\+Iw��5���F��R�J�������zYr�u�Qv�u�O��x��Y	V���n���
�-��v�!���&�zn���|��L��;���M	�*G����>���n~�e�����u�<��/8����t{��^Er��p���0{���T7�Cl����0����nX�\\\�_�.�v��MX��y�������`�e�>u�(��������J�:4�����'ae���8�qaA�K���vTw�
�I6��m�	�t!��� ��3��/���	�]����E����t���dT4E'��ql�<A�=M�+��2`WJ��U�`��$#�|�*c�)�'��W���������n�����yr��_����������.�J�>�SY��\�:���96����
��NWR_F����i����{q�:��������gq�Z���N��)���K�)���N?��������'5��-����
:�+	e�uh��i�i��6Cr��A����!�E��@_��}���WE���f����1��R4�>� �3�WP���Y(�z�,
�r���h9����i�oxB-���|�����t��t���@|�@�"nC?�m0�G�������ht�i���|�����d,d��{�F;�a�$si?���N�����e�6�5�����N4L�1D_��c#�G�dA��Vf�@���y����W*����[�}
��������\d��;P�<�����*
��x�I	[I��	�
��X�^	7��"tT�W��������ST�<���!�:G�^�QF��d�Z�.&NV��d"�/>�?m�7,�O7%���G�d���0�?�C�h�`���N�4���c�c��,���b`��U��F�E��C���R��{O�)�\�W��60���������t�)��AT�p8�c��a��p;�b6\a���F���u�P��R���:���tge�c\��E�El��������n1$:r��4WIr[6��_fI�>�?@@�f:2���Y�4�;��N�[�K��5	�T>�L��
����f(���5��[T�knv*�HJ���7��fuw�)�wS1&������d�	eH��J��7s��6����'��m��C��C�S�g��5O,4���������
������*a?{��+21m����_�$��YF�C��fy��$>|$a��I�R3������	���O�fz��z K�fN[�C�g��,�����P'���}J�{����������M+�`��l���P��E6���nE�����l���:;ILaV~CU��E��M���A�YX����K��`~1(���~�
:rL,9/�S�;�c9=�-J�R�"�pz+���*�)�������u�=z�_��-���W�Zs��<��5��52>u\ ���v�	�3;�o��������l�i��� �uuO�o��w���j�� @�u�e���b��!t��SjF�
�(�&��Tl���9� ��PXA�'Yj��N������X�/���~;���#��w^���YB��W��1����t���a\�sW���{z�������z������p,]��uoz����)��'a�a�f�����'F�=�{$����s�\��O�o+���q��g�����lQ#�����Z�|X�d�f��S8�na�S��>�I@�P�f�O�r<0���4x{��q6���}��e5��u~�Eb�V��C�B_Q�s�_PXY.hV;��@��y��b�������{e�6�,�A�Lv��B�r�����G�i��Z X�����a+�`��9b�������,"zh}���GL�^�����!�� @���e��?�����yJ����8��f��j��tI?������a���`N5������mD�}��y�7v�)��>?D �]��y����j���@��7[�6M6�Hy�_���J��+/�t���G �N9���������>�'eBG�XD'����V]���.�M�>��� 0��m���[��v{����\�� `�!,���^�qdyM�2�Z�mI\��2���tiAD��Y�^�2���-�����YtpF���3_-�f�Z���������{.z�;��"���3��
�eR�M�2�j�ZW��.�u��GY����A?V�R�v�������G���c��lv��mvx6�T�E�����;�@j��'b�,RS���U��iob<V2��O���D�y ����k�$C�:��&�^�s��oC��~�,*xm����no��4;]������z��.f�v�!��CT����?>
)J�_�����3d��BI`������A>�/���c>��?#���p���&�(�{�����A����u�r����X"��;S����q���+/���-S�_Av��2��Nj^��N���S���vW�y��c��]lz�{_��YIAu�Di��T���Q�<f�����7	r���uzr����!>|���:���uL|����q��`���,��'��=A+{�R���	V��Y�~,.UB�������6n�_�K_
�q���h���lw�����e�Ml���f�~|9I�l�C*i�5�]5�������h�)�a�����G/���L��La�La�L!�L�����Eq�R
���.	P��������z�������~: �h5�r�1+* ,���i�c�J��xj���=!�C�_d�IW�5p1�!)��_�������T@�>�\w��3:��^������hj����N����}d����6h�>FYA��B���L�o,jh����?R,���B����Y/%;/^���n�� ������bt���W��,���]������"����eh�b���_���01S.{>�2X$u^����	6��m�������7x���Olv^�������%���H��zZ�*VPTG�Ks1#�y�����u�
�z$���.������UL.�le�?6��Jb�6M�kP��|�����B�A�P����K
�/jP}P�����A�8�5���[���q��,"Zw����0s�n���L�ZQ.���x	>c��K2� �)u+/Q[������p�Ak��r�l4���b�R������U���T[uRDH�PB�?���V_����xEU�i�*HR�r���E�+*@3�1�tH+Q���wE)��L)�(JQA�\y3�����Bl���r�����1�m�mCY�HN_.{w��X�5��1����Wv
X,#`X���r��!�#���[CU
Py�`
J�$���s[��]���vB�
/���#�����C.h�����{a��}
��>k��*��
5�u%|�c9
PI��e��c�����1X}�>��Y����(�HE$�"���"z/����./�F��|&��6iI��DW� Z����8���H�|7�}T��+\
��<��V�o�iG8��*5r68;{uBi�����
�(g!�
�
.�_TP�e��9:���/�c�[-����X���Tby}�Um9-7L��E��K���O�8@C�IY�����|�.U������+nb�4�s�M��i�Nf��oo�_-�dT�QhtW�����[����z�����GP��A�v*���������� �ub�V~�|b��R�� Ed�\^�[�~M
��|A�=��a��
w�z�6N@���p�u���*I���,�_�"�W��W\
&��&z��N@�)��%q�P����*zu�:�L�n���'g���s����<�>)�>p��>`y�>`!�>���>���>�g��9(����c� "����U������N1�$�`"*B��E/�����������vkDy��\%��\�G�A�����/l���m���xo�7��~Hr2����*9��i���xl�2�H��������,�&���[eaq>M��0g�w���2�o��	�������������$n�aVSBl�b�^�(!=a�O����d��Ye�Z����S)���&^|�WQ��@� ��I�H������~�D�:j�`,6�:�Y�����x%�pA�/�
{���xDJj���r�>)��O��'�����	�lz��}�_QY��
�_����������;>�	�x�)J��k���,���{r!���8�����M2����;z����w��O����I
e�d�;X��h9�I�����K7�����n9���?��u���$�@��*T�����+
L&Gd�S:�$�T��G�T�6��
�c#hD��0�x������e~��W���y���m<���vA`��<�P!��.��v�l�^����>fN�935������<K��4u�d�4��{�Z������F�{ Dp��t��L�u4�m��������������x�x<�=����������6��&�L���j2�jrkb��a31������8�k&�?��a����{�eO���7M<G�e#�����2l�%�&���_
������Z=$g���I�� 0Iw����9������G�(�D��'i�2������j�Z�Ea(�0�RJ)7J=���=k�n�w_`{��W�3$����u<�/�gA�u��$��g����P��2����^�V��a�=�4%XVx�_T7�)mK4�7Y��j]�^��b�Lf�<�VQ.8��6U��m���6�lw���*��1�������~Y\�y�'��k�K�{����t�V��@Y5��K��Y���\�b����zx��g�$Nb���eia,�g�*Z7[�C�q���tU��OZ�����0��de���e����k�9zS
�����K����;��hC ��~��f��p8���+%�0�6c�����;7t��2��LO����c
�g���v�����3������a1�I�Ry�e�'�)7����M����p�lz���mW�T���E���m�
@��x�<�P�[���l5�p9��P�q.h.��ta��x���(`8�8���W���X�8������"1���K��;��I"�KTTq^:�'�0a�y�7� Lk�Y�~���0E<F�<�y�f���R�e�K�d�3M+�&��M�R ��h���S$�?w�.��B��w��Q�u�/�Q�@?8|��/�����E���41��o?�0z��1@���z���{~���"�.���\|�6)�>h�v�s��>�� 8��0�P���,��w�}�3�X���9s�F�=���g�U��Px��9��h���|�w�����A��<�����GSag��i[
�j�>%���Z2�R[2���7*�����}�2�V��)�C��vR��� ��"��	T��)��oo�f�%I�m��g ��}2M�������,��f2��6�����r�����2��Xa��
��w��ZS^^�a���`�����|��b�)��4 ����F�|�N�����������C`<Eb����i��	�q���:`��7�n�����C7��@|)Tvf���H/.���|��Y���������(u��&O�������3_s�QL���hpa����'�O�&�fhs����T3������9��RKA�P��T�;��[�<hR_{yeS�VbG<��N����[�-���r��4��[�i�%��@���U��&������ga�3�m�3�x�O�G�H<������4���^|���^"���&h���~�S��J�����=RZty�-��<y������o��AlO��
}�j�}�!:����p���I��Z�f�;Y��G!m�a�l����4���il��Q�y7���Q�[}x'��n�vJ��������<������(.���>�����T��8|G)�W��� ������;�>%�&��!���l��n�������h�P�K��e}�F�:|�����GK���}j?O��
�u�>�:z0��@��.��*�>�]iu������N[Z*�!�a_����~���!��W��GX�S�z��m�'A�Q�����C�i��/��r�:N(�5���~���"JAj�)D���H�B[��T���5���"�q����W���:����H @X��~�H
bl��z��������9��A2�u"n��}L2o@���e�boJ������ysG=�J���f<���n�,�Y���D*��n��9�M�8��u.��e\�_V�EZ@x�TL UOb^y�vDb|Kf�����b�\�(���R�����i)�{�b��k���W)���$4�����v�y�Ed|��]�� �^i���������dX�_����FQ��}�|� 3h5Vp��Va���9yJ��ePF_���������G�bS��z`�V?u-����]z�����d���+c��R��]�����n|2���-R`UW�����;�D,��Q��x�8�L[���L�K��xE��V%Y�M��.�Y��6��k�rs��|��m�h����T��R���N{�7�z&�<�����6���C������.�-Go(0m�e$d������=�"��9�^,�c
 �� ���4�H�h�W�Yl��B�������o����O���df�X�	�d���t'� i$$2X�jHT�m���!�
@
(��7��o{p��
��O	���q��v�sAd����h0��9��e7O�V�5:v��P��Z���dc1�����v�����?��L��_!��j.p�����Z:���g2v��C`
�<�T�������4�i�w�N
2�.0��tR=�I�q$z��d��������6B�
'w��"��YO��U�����
��er���1��ZN4:�|�	WQ|������z�M��������,��@
��f6&�E�p�rF ��d'e�������_NCsv���j�������T)%����P����G2�4��L���:��#�|��m,�t��g�d�F����6� �2!;��!��Y�����@6�v��Vl��|���p�>4��C��9�e��L��.���Ft��������%���%_L�{'p]��p������[�2���I�����nD�b�X�D��w���=<�dg9��|���ac|����#�� uL��'#+D6��^BA#��!����7�=1����$�������)�7S�7�A�!�����<
�;Oa	!��8����}� ��-�6	��&��i��if�J��_����=3�g���uU/z��C����[����<�0{�{F���O�D������K+�m�R��,����W����}�|�
�B�25oh�C�#��MFo��~���:�^G*�3"�s{)#�@`=����U����g����@u�^�$}����.�F�>���B�H&�D�����^w�Uc�93]!x�N�4P��l�'�5�	���t�#�i������t����rY��V�3��h4��6*aq\I�|�=����GylL3����[���V���N��6���cr!T�h6��%z�m���=��5�)�gb�>������s���D��
��]�~�N����T.%�EJ�����H��B��A�Hv+��C�������.f8����������x2�D��U���#=;"x����K�����"�:����	�F�g<�/a�[�rdC���T)Y��S.�	/@�aj`$i
�GeF?����S��8)��H}d<����QGpg!�r��?�M�e�Xb��89�J�����Q�����a�L�q �N�E\����	�{������H.�?F��"�c����Ouo0�������
�4��m���b�N�VNn��������Ss�����v�����nn�uZ�h�R�9'����[��94�l��48a�����!��~LF��(4F��-��6D���c���Q������h��������Q�hn ��F�/�6W7��4<�nJ��4�/����tkC�n����*���S~���S��:y/p�%]��,z8��/R�p��:R�Npm��_O#kg����S<BX�����,&p?�����<�$��p�,A'u�{�H�c4�z�pf�-�� ��|sK� NRFa�����[�d�1��r���=���� �Jf�	=K^K�y��VR
VRRZZ��������R���Xs�y��'o@Fb��*
�0��3���87��I�]�t���������*���ZV�+\Z����O���B#���X�U�F9Yk�dQ��,
e,������-�l=y����U�y��N�=l.&����Q�6�1��|e�������'[1��S�O%��g!g���v9��Rz�J������L��&�����}.�&��:P��:��b�N3��~#:�9�9�9�9����"��cU��=��<�G�=x&���^��#`�D+����[/1��q�_X�i�g���u���Lc��A�O�c����I.�.x~J����/T�
���\�O`yLac��>�gRiX�U�4��dF�s�Z�)���7�.cSX�@��'��q5�5�C�]I�N�TjLS?������[���R�g�1 ���LG,k��`�X6A�j����gWmI5C`�:k��%UB�]����#+C^J�~k�������}�cv,�	c+%��j�u6?�y�
O��5��:���yg:
QN7�RFW:��a��/�x=� &���{l��2^���!C0Th�)�@!�x�S�!'���!F�f_��iii�`�NK�{*���y����K�����O!�K�D�15�!(Xi��-�!Y����Z�~��Q}���B8
�����O5/�o)�L7�1j�$����d<2p!��hfZM*|�T�=������<����+��KQL��	��&�}-������c@��e������[(�V"�����������E`PY����������>4]�uu�uS6���ZO��A0�A��A���z�S�@���ee���EW��[���m����	��$�I`Dx�vY������`�<=lixN��Z�/�esY���|���<����O�"*�|����c���Bcy�<�������OFD(�swFY2Ir�0v�ZaM�`���u-2r<vn��
�b4��\�q
+~{PZ�V�-g��!���j&��|VIs�99<�3�4^I�D�w+n�V�Q��g�����4lfcLR����F�H%�9��!�h"T�����8-��!�����_d!o��/]�|���,iBT�DY�����Q@+�����<y|�������<���MBU)��7:�8J���&��8J5�5������"Rl��;�J>%I[��A�%���	��9���F���n5�5v�����kJ�4bAZg�L��hq6��
���<�-������6R�x�W����� N�rfx����&Bk�N��{%�6�n��[��W��,3���aO�E85]�}�@'��c����u�������g��0wn�����iP�C$�ruP�o��yGk�cj�uQ����{�4l�����.��S�_�|���2M0���)�4���|�3�

|!��Q�������	U?���]��~$j&A[�����!3A������������qirO^	-��)��6s�N�X�F���h��+�����s����&}E�1sM6{��=�e� �0�k���^�A��!{U�_F'��o6b���DhI��`mn8/)���2���u���P��aE-����,��r`*B[���f����u?�[�@����D[]l��b���6*	Fs^6�0��kN��yj]d�b�Rgar=�j�'��!���*�ke����a�z�i�/�����6y.
�l�?*�m����>=.��E���FW�� :�X��6/�T�*c���0�h:U���U����Udr��3�^�5�K��t1+��il�a��{#���\��'�OV��pl�s1�@R� g/��
�h4/�B5�-�.�~��b@�G�a������u�T�c���{�N����@N:��4�0^�j)��[	<��A�xN����d�����O�����d~�j�uw�<)��J�/y�j��	�);��me3���iJR/� js����u���9;o��?c�n6��3�AS���O 5���~���u����:]����dE{�c<�";�x��r�7v����M.r r��k �u������h�����2��@h
������z5c��qw�R%%H{t��A��E�1�h#��~�#��_����t����:Bk������Iy���^Wc���C���jc�iZ�n��f�v-.����ICLO[z(yyz��%���ix:Y�����E��Eo���^�������
�G��v�)#���z:���U�t,G������Q2�f7�d����Z���=y��|x['j��_H�6wC�t l��R��U��W��?c������0Cgm�����5�<����M�O�!�G�n����UhXy��q��v"��?�� ������ !���*in���4uRq��B�L-�a�A��������T�&�u��������?�p���3
���{>�V<��$�X2�MQd2���1�^�:G9TP �&�@��K>�h��3TxXs!�H�y^(���-����	4����H�X,%<g���%J��S��$�X���D�M�o��~��w��%�DiX��a����,�PK��6G����*��_perf_reports/12GB_preload/ps_0_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt�]�n�4��SDB+@bC�� !e��;]nB(�$�4t&�Liayw�ss2I'>qJ4?v;����}||l;�+���k�+����J����<P����Pt�]m���)����2�������B���/�\������6TG5�������*�:�����1�4��k��q�����t}�@
���EH�`���6NXv��_�ko�w�&�y?���n���-�G�&�����S^��-��Myh��'��Q%�CH�	M�_�|�y���v!�H�Uvt�f���cY�ey�r�!K�kN����u��S7�A������|���gY���5M
�\��{Jpli��R���O�m%H�(���o�_z�����7���\�w���fI�'k����������=�s���*<��?�P���������Pe��������>�I��(�*����m�
���G	�|�%�<���}�.!J�qAv{��w��LI�����*�
���|���M��g��������,[sO�J��@�.)	E9Kw;��_���%a����4(����v�N������k����Yy��=S�=��&�y����g��SJ�/P{���(�����������������w �V�sU}��tM����`���`(q���fai����C���/X���gc`������g���ij��rz,~f��n�+�������.�z���������,|��8��������(�R?�����H�Ax���G���;��~~����Q>���P��@

�|��s��������fG�6�%�/���A��=..���5��`5�@8&���9z�\S=�j7zh���5O�g�Z���"�J���������������~|H�l�oi~�_�����
�&f��a�����������AhIh&�D�V6� E���.BM�T=K�4?�I�Mg$��"TD���W�e�kh��%��GR&r9R�
BMBW]o�od���`-	{�\@�����"����V����N�$��d�f1<FZF&��h�����d��3AK��?|�W�����I�eb�[�����}vH���Y�`�x������������-�j��Ir���hj��B��o��-R�z?$�a�\�A��	-0������Y�N}����V�`^&�]i�� 8���c�u��j�K�3�l0�Z{�S;m���`�2���zFV��E4��b���y���U��(���)�u7H/��Tn>�;�xG�($�ZP�TW��Dn�]��M��'�l�iVl�/T�)z�	z,.ip�7��t_\�!&�����0Q\����8��M��<�e��9�N��@����0x�_��c+�����<\��<R��u�{a�G��RW�1S{&�1�|���6��a�8������0���SZ} Ax�y�������w�y,]����X�D��<�����.�Z3I9\J	�5���
�V�IM��t�3��cy��y��q����1&k����k�
��c�Xy����X-�p��� �7
��6����X���!a��
Fs�<lc0v>�l�f�����<��}$�u�9zT�t�p5�A/�T���d2[5g��D�or�C�i�Q���
�!��.�.����M���/���]/"����rI��@�J~�T� .]/�����8������.�i�<�yK�DK��n�.��]�Nf��Y���:2gSiQ�uCYW�V�T����`�b{ZDdAt'4*�bE�)]��fO��-G�h���:�n��o
H��tq-�LY� �.ME��+-�AH�����5{�h%�����B]���y�[D_E��b��.O�u\g�h���'{(��<x�MLL����:������^K�qdO�y(��=��H��Wt-�Pc%o�x^HYse�_�I=S�y+�oc���l��+H+�������.�E\i&���t#�$�����:��K&1��c[��q���t�����H�?l@���u�&��Z����E���F���P�� �������
M��k��\i-���'�u�0=�>��$$���]��^��^�������FH]i�K?%W��>{*�cI+�����5]O^���kK��DZ�����J�������:�F	@!�k#�,i�^��O<�Y�:�=��5��I��tC���j���G�W��
���TZ������i��t�mW~`=#��� -k���L���X�_�`�`�G���f��P�4i+����L�GR`"�� �>d�y��������#�F�w��$.�p~#
��pIxM�nG�@�
)xdQZCN}�p������{����y��_����X���8}x���"#I��n�]��	��F�d���dRg+9��n�3v:��4����d����S��6����	�
���c�'���������&D5#p6��e'�c�E{�� �f��3�����Q�O������f(��p�+!J�'��U
v4T�~����E�w�LE�oE����c
o���52��E�"N_�3]0e�`�P������0������uSpeX��`A9 H������@����!6G��M��d'U��`�&T� �e��A�'E���~B�
��� �l��H�pc�T���������`(���H���'��,*����c)��A�N{5�`������0tF$��}��� y����]a�")�[�;$8Q(cY��R�M����}���0��R"p3���`^�~��;fB�!G�k@X��� RB��S���J�<�Qv-X.Q�:L�>N��[�lKC7�<�����t�{���B���h'!��'��X�fY�T��E��n���Y��|�H�p~
�����/(l���g��y��g=	&�>=d��w�
�2�i,9a&w"H���������R���#4Tf\pY����a/�����`������U�i�_\c��Q��%�r�����	���������a���S�q\�WpO���������_eQ�*���~����,?W!b�� H��o?�L����!�E����Q�@����@����^4���I_A/���'0�����"4,f\6��O}`�,��!�}�{���h6b�7�O@�3q�J�������A�i�����j]	"�	\��K{�����=����QS��+:�?�;8'�)�������!8qN���9�/B`o�4��
z�M
�m=���E� R_Mp�!2�r�h���}E��.?�3	^<��l���%�x���(�h�I���C�����&.pEA�N����/��hQ���mcUWHW����m�7���\�-^�CNG�����U�-�N�+��j�����g��jM��������	����a�C���[� A�ghH�}SEC���Dew�m�p�
���odE���SL4���[�AT�W��3o�W�_�z��2�SD�		[zl�^�b��A����n�^��-��(������(y��{�f��+�:P"J>Dh���t�EA�zG���6C�� ������c�� H��(�g����=�B�
V�d���`xD�>��/�l��q9(9�_����"�����-,�		^/i��pR��D�$������[	�/��qV�����>;�[��A�$�����#�fv����������?�aJ�>U����h�Tf�@;�������`A�;B����|vB0\8r������C��93��C;QB%�l��vs�9��V���J+������W���V:l��:#�G����UU6gCR9B���@�]�Y�@�(]�yz���Lc���W�a��� tH`1��(���-��"�"`�
���VM�	��Ut��4z+��
|d9N�	zL�]�.``���.��������-��8��v�.�~j[��`(uslt��'AV�B}�}�e� A��%�j��	���#H����i
����l�}8����5����l.���k��CEC���=�T������[���?a��=_z�r�R�g�iVhbmyA�W���s>�|� A�<?#����-���2}N�/��p�v��l��c�eQ8���������"�o�B��>dG��#�K�XSIrb�It�������T�:��)�e��I�
�+d�p�$���j�W��	������Q�%�uGu�.(L�]�#�(���RY���>��@�+����K�H�A�U�}����n
f{1�X��$�UV?�a���RW^��_��@�k����@_��i�\�g�]B��v�BM�f<R��*�</k/
EL(�������&
Z����������V@���J���l{�]�EO7Wph��4����~6��P�FE���,%MB''H�K���)����|���+��@��(���d��`�y��:l��Vn
^�$�ghkq����!�����]�"a++�\��C�������=�����*'��a1�K�H��s����������b��t���
���~���lHs�\/���'��f��$�](�Zj�Q�d25v��R[���&�������"0���!Bbh[�M)�c�b*�A� Q}i�@���N�� �2��$�B�P�#�
y�g�����%�f�Q���l.����������Y�~�^� ��!�Hg�g����g��PDj�M�����.
�^0��� ��;�����}�l���U]�O�cqM��.6�e��_�Y���q��:��a�
]���zWo(X
7bI^�q7��F7�@7o�k�i��R��3��3�)>EL�)b�OQ|���S��O�)>�L���vEL��i�V�eY`EAO���C������2�����������z����E<q��'i�����.,�d�l@���5�=�.��������{[�_�C�9���|em��xAf��<���$�+�5�#����}�L�k�'��%���A����0�']�;�N5M��*uEF���")&�y�:E������q(0c�h�����Sf"0������4�|
7u���-2�l.�x������@���pn����r(��bsb����^f[Xj�x�f��}c�-d��d���<��O����POJ���s����J}���?i��	��ak<|�����7qN�l��
���P:��m�Ss�T"~�����Tb�$�nM�y�TMj�z�h���d������K�EV��D��-\�RN�*H~��M��D�#?��|G$k�8��FpfHRD�_&0�N`s��2�|�w����&�6���E��-�8�"���*V��q��W�%�W3^-�,p�S�C-5H��Rx�=������r{Ph�����K�+E����"���J�=p���g�FL���o�����1���)9�J��Z��Z����M�`gYdVK�s��;�M
�M�&��~�b�<�m
����=.
Q&�z����������}��X>�CYW���P��+�����2�Tr�R�����|z���.$���Re7�a�nr~d�79T��#��ho��&������b������p�2tR����G��5�f6��#� �#kFbf3��&�E@z��C�Z��|}��Gl�w�p�h���gA��8Iw��}*�[*��|*����������\���_v�sZ����A���dW�YJ6u�2�{H�-�9�G��~��Aa�����-r�!����������2]���t��XuI��\��;�$�6���d�N��37���,�����iy�j�R����y�Y�6�h�8�G�������kZ������,Z!:'���G����q��a=^&x�bl	Nh�mdamr
M|"i�B�Y�b3
��BXl�@��� ����u����Xh!IxM�>�9���a�*wk5�:$�]�b����V�Y�"��#qD�L��7�+`�� ����V
��������b�,�>�8���M�z4_�����v�>��&���h7���|�u����������\/	s��������w������GQ���G1o�z�Nu1k{z������}84�O)(0C��Ii��4�E��Z��^�Uqr�a+S��`zl�J�!����i>m���e�-?�1f<����B�5����z���T��1~�����p�@
b���	������[��z���������W��/-��v��7�%������sn���!�+���?LJ��S�w.�}9��2mnM��f��z����_�K_
_k;om�+=S���%j�fuT�:I�_�"W�4C�����6�I���\��8�}@S�A�g��'���X��\\�D���s�����R��'7���@O�~�|n��m�:H~�+�����G�m?��������[���i\�Hg��d[��+�mQ8]�-)8��pl�A(
N����j��0��-����}1T\Wq7�^SQ������l��u����e��>�.�1L88��F��A(`k��i����c#V��Atx�L�8�����D��M�KB[�P��IY��E�t��1
�Q������4��<��0.����r�����{�g������K����3���}A�z_���vyu4n�����
(�h������(�g��;��~�+����dL�d�nw���������.��I��K�)dP�#x��<�R�����I���igYv�����3R}�k�b�_�T����T@n��^-����<~?=9W��?NNN�t0��"�v�1����0����)
_�5u��V��g�2����k'��k{�W�K����vf���^��SrR>�<e� �g�}��&0g�i�^�
����[H�la� �*��/���`w(�c����,,�9xP`����d9�K�������x�/���r�����z����7bA���y6�P4��l91�|_�?X
CS@"�7�����j7���B������Z�������.f%N����������I���3kNAk^�"���a����+����U��K����'���v�~*�V�_xt+O�~-��@���� l1,�jy���*l
M���K.��T����(�6o�//o�%��Y����q'l�+��h�~Q����o1������E\|����M\}����I\����`�R����~aE5�������T^��Ju�%}w��EXm
��+-4�e�U���M�,���.pAk
.C�I��<�u�	5���4��,��0Z����%[�VZ����Vhn������)�@�������m���=c�tnNNa`�e:�U�Q�R�-8$�DP_'���NrG}-��2��5f�����a�����b���"��O�1��K������~�
	��g�0oF���������rQQ�
���r���/�0���\���*�y��M����0�Jp���U(�M�P���d���p��1	�����������y"Z8��	��@�����X��^��nePgB���gB�d\]�*��>���C"h�����!��V����v�DCJp{��b����y��"|��kv[�~`��=s%*��"{��-�}�����W���Ei�}p3�"���R�d%���N3P���;��^���Y5�q�P�����-&�e�����F��� �t���=i����U^s������:�M�"$�iI�+���]	��7� K��-R�O��
�����F�#�,"6���Q�NXS�z��)Y{o�Q^-
X�%u��[`~y"��Y��R)�HFa�����Q�`�4�Q��}Y�����.�Q�c��i���E+�E����I�(��S+On,b
i(���K+��#(�=���) ~������kU}���Z��*W-���o/����@gU��s<'Mu_7�&y���Q���D�|��x�]��<�|���.�Qg|[�N�%"K��R��P��m��/k!+�vO&�����"������EeZ0��Pp��Y.�������������g0�����k�:��b96gdy��������A���c���r�Rh�(��-�nk��6K��o�2��rH�e��K�,�[&��EQ��>�)
�;��RJ��	������d	�����4�c�JTe���,P7r+����G����`����>}��a��!`Y���7����g;.:�w�g$(��+��{�Me��]���>��cw���>Y)�S<
�t[��
M+�����$KE�_�h��n+�5<m��Gc��q|*U�!i��@�g�i���Q6{�I$e�������P���Ws��C���������wbQn���6�<�(�,������Y��x�y����:|A��{.�p���z�������4�BM�V��x���*J��[�p=@��cs���h6��������,���m�����dQ�:�������Z;)�
������>�X��k��/dc����!N�P(�A���u�N���0����q�"r�4LD;G��+��k�Wr!��Y���<��Gb�H0��^����%R������|/�
t��jW'�'(�(�����n��������e�MW��L_��u�0T����eFN�c1C��
m�o>�	�8�za�W�v��s0��l��}]��
n������y@X=�N1K.!.!@.'��a�x�iG��V������g�a��Cc�O�c��A�������l$38$!��#H��a��Tt����Nx����`:,��>��H-z���_�U%���V?�3����0�-i��d�;uf��L�NE6�}�h���O��g$g�P����'����9^b�K�AZ�2�A#��p`�)[g�#���K|�2eH��S�����=�s3���H��P8��:o�
u���h`�.�����u�����`gs���a1�A�@�����m���O9���a��;�V����P�!�dG#d����?6���=�n��Tw����tH�t���^�Va�#!*7���^����g`BL �\9�(�8���Bv�����{j�6����`�P�y��T��u
(^+
�	,��C 9;
���-
�-v�*(�j��<���M��xW�B8����FY�I��~��������I��2���t�a{K���s�/��������.�6����g8�y���TD��9�d_�[��:��N�*�A����f�GTz�~��j����G`�����g���V���)<���#	�)l=�����s�f��)�gU��K�����]=����U)���'5����&�}
X(�!���@2u���f�w�(��V3�|�X�
��1_
!��I�b������� ���Xqb%�V0P0\0�1{B���@���+�emT�	NxB?1��(�����P�_J��,Q
P�?@�u�<����}i$�:$c8�~�����6�6���(8�	k�K��.g�0�������r�qm�-;V���oX���(�n3p�^V�lc��[^�l���n�]@�l��&7.i��9/d�A?��!v��c�e8my����[=���9�y�;��&�aV��"���F���0�|�L�P������37���e�H��O�f%A4�>Q
 t���&�+V���#��&Z����P^������j�U{�|b_e5nk�#5j��#9������;^�v9�
.�]��"z���&�S0��U�E)��L�?#%�P��'D�\]��P^��~y���%�l+znY���sH�a8�7�r��Z0�iQp���k��[0��Qp�]��9�e�J�7K4�����~�l2&_&pT��x����P�T�F
�+���)O��-D�)��\�
G���k3U<O�����Y�B�S�Yu`�_���Q!,����K�RZ�(���;��m��������n�.xCa0��Ix�B)�IO�:�"v���H��x�a�����2���0�����b%�+��=�������K��+�������W����XL%�$T�`[���R6k)�.�=�>�{�%
�C#���K0�t�eb�w���d%<��C��8
8�
��~���,h�qy?��-2Cd8���;�S"/1�uZ|�L����g&�y����?\J_�H+��$����/H������wg��������NM1��)�4�Y���aX�e�gOP��������N������cx}���;N�.��@J������Pe�*����_���x}��G�s��a��m��2<�))��1�c�;XY B�U*]V�I���i)�5���F;����%�yq�8
�O��7G�.�Q�ov��- &2_�*S�S�
�
c{�@myf��Q��v9�N��P0!�a�m�o�����L9��t-�)�z���^��[z���t1��H$c!q.�+3j��|`��f�Qx���By�q�$f������f�'����VT4���[�!���CpGk���8����1�t���~S��j6q.�������2|,C(�]�O��ef�����������a z������RX��E� ^���D[�$%I���f���e�QA��2}�y)�L����@*�'��
�0��[!�`Qc�L�	�(���.��3��������������m��b�����@�4I�(M����b�@`���\H"���w�#���f�������z'*���w�)$�9�Y�v�jQ.D�� �S���b�l�]������'��95Wg���}��B�I������,�*�8�����%�m�k����u�����]e�"��%�f�hKo�--<�:���+�����rw�\�E���z���[m�A��6���O�� ����S)�`D����Xl�Jk����oPK��6G���]�.�_perf_reports/12GB_preload/ps_0_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt�]yo�D��O���gl�q$$��8PN!d9�&5ul��4P�;��;�����
��5����{�����&�����&q�8���(|�|�SrEc��D���k�����]Gi:[J�'N����c����I���b*3��ln)�L����e��%c���-MR?
��U<��7@�$�5�l8���)���U��������o��GS�^�
����3�B���7?������D%������D�$��0��������<���	��n�d������e������0����{+?|�x���Q���"&�j��@qzK�
��qwn@�wI���W�]�F�������oZ��^�wn`C����vC�����U��&��w�]@?����}���hK������O.������_~������B�p�����.���_\B��O����t_���G8�vP��+gC��O�$ZW��V�XoA���7�����8N�;����\5-�������0����B����	�������&����������n��`�T��k��|\wQ��h�!1��&�i������o��I�����c��g�v�����.>ot)�{�^�xq��*���x�)�y�;�3`����)�V���`��������B���O�(�=������0~��\o?�������bY��'9����;KC�`��h������Q*�
�\Y,�>�N�=k�+'���O��~<?��0N�����n�$������6i��c.����sC?�C7��;��A�utE���1�
��,��_��@��	�t�M��u�H�xg�hN�

i��6��nx���`�����	�e�q�(<���r�e���Nw�;��v��?Hw�dtk��>�g����m���*�ky����������y��]�
��X&�5L,�&�w������o`�5,�R��%5FM_A���L��N����o!�"�I�����P�7(L������b�9��o����8�ai(�T����������������\��,#^��o �����M�$?:��N�!!�<3��R�;���Ij���^��H�;�V`pt�eP0��	w��Q(*�Z|����{���G�'uS` ��;8���-��l���a�+Kt{i��h���@�����z=_.`�����
���f�%+��4Z{��g�X�������~��2�� �\����;�x��G��P�8�U�GE_!P�%zr���e�;�s�|k��{C=n�[D���un@�e�������@�x�N������'c�1(��
����[/����0l�O3I�f��`NO:�b��3Nw�V�I��H��@"��BY��gL[�\Y��v7�m����G2S���pMz$3e1������`���2���,U~j}�V�������u,�G����7H����-�[T}��W!y�:tu0�5j���:��h���
,]�����s9���y�K}�Y?���X��c9�����e5X�k]N]�6����C$�q���v_�����������(�& ��~��~����������<J�� ���� 1�[�^$���u��}iIQ�hb�������@;/3��q������J�k����V�,q��g�k{]y�����Q	:���y��2�h�>����7�2w�_���5��2�o����S���5O%��o�����^(�i�5}
!0"9����3���8�l���n�����KU5���pL`U0k&�j�21��z>o
&�	���(c�6�*�dE�_��d�L����L\u!�d������&�dE�Gl�u�)�4
;�o�@�(!��C?���(:B/�3���&ddj��w�����!����Xvq�{E��@�l��h�����Q�7tm�N�~`�u��1W����C�0�W���+�['(4�1z������~�dn�"����H���=L�-e���,����"������M7T
��O�A�:�bp�H�T�^G!�A
��$�2��cp��c��L�1[�J�z.m����@GX�����Qrk�����sH�	|7{�g'[�}�L I!vF�x�����P�O;��K_���bV��#OF�����i?��e�r
�B'N�#6�10�-��_H��!~�{gU�A��1��Oo�S�����u���F�J��w�q�^S�������x�}�EG�&����:(KC/���"��LF����aHKc#�4�h�ut�]�W����6@��l�i���[h�S���
'���.E6�}������
gI�h��r��l�����xsX�6E�T�t�S�I��&���5"� O���g�~1r������/��2�bp�%0V7��hV���!�(�K`�3�����s�E�;����"��C�a�F�D!UT��2B��Ao��x�.U
}��4]���{��/sI���!�(���@�S��5���8�K ���M���<����Yv�E�%;nD�%U��'�}%�����$N�2��huH
�3�]|ml ^�K���]��~����XdE�CE/�HC����H�2���-���j���wv{o;���<���r�qzDK�	ZK�1n��cs��K�IJ��5���{��k���
,�������Q��@�K�,�;xZ����DD�^�5�,�CcS*��zL�%�u2!���}F�F��xo0�Dy����\����p��A����!��X�>U��`����[�����������,5}v!�\K�IM�9�����3�5Y����!����#�-
;R�F���B�J���S�/f�N����>������F�����J	�J~8+,6+���c�6�
?�I�K;t�%�t�d��O�i���TMn5���_�
�q����*�l�����s�	�9!�R��cL�+������6S,��hS����Y��'����#��&�����[����`���6Dpo��K����N���_ ��O/$e3"�����S&�`��%���w$*��l0`��-��{E�2v�`��[@h����s�U����cU�9s[���W!"�x<��H���������o":Uw@�bA� �����;V�����;��^K��s�Q|��m�����B:����J�a�`�m��� �b����X��b����2jE}rJ<z�
����v\�K���O���zg;�2x}zD'�����Oh\�9�L�����+�B����+���>Kg��5'`�`��-����|\�+���rJ
[c?��o�^�#S������3��!h$}�A�����:c@��`�T:(�(��QQ 1@V�}��?�~�}���$����Z��a�;U������b��� �uv���WL,��u6rk� ������y�S��5c����{�/S�h�$��?`9\
Vu�
�������`������6�Y�xu<�e��+�m�
�(���:[W08q����y��|����:FA}�4dL����0{OJ��!�;f�>����S���:g����PF�����hZ�yKK�bm�xx����1W8����{�9����:
�:���V����K���j�n�>�fK����i 5}zr�I��U2�pH���v
�/�E��LK�l�[V
������6iP��������n�}�=������������Z��e������M}Ah����@
��u�F.OU�v���hR����&e����cJj����p�C�+�fg'm@�8k����*�l�q�DX0�����*�����r9��bL'�]�f,}1�V]>��R
wSIf`���+�������3_qKg�S��c��)��������u	&��2PuD=0���s�����G��7�����9ni���m�1�:��c����(������'2��l����p�>�4��Gx#�Q
��b~��~����a��{
JS�u�gIX\�ngNzsBEz��DG��[u�k�6���wy�=����t�w����9L��U�����q�r�5�F���\�L���fr���*���k���'%,v�8��(@�hR�Ty�dG�%P�"�����>�3���%�|�d�y�4���E�xz c�p���.�<�N�;N����>�������O���5���T��;�C����M��������v����,�]�s��Ym,�^"p��O�/8��C"�8<��<��;(F����D��E��a	����\������8����7�H
=�t,1�*/�}���kx����Sm!.���<����H��PSY������f�G���UX2
)�BsU���g>������3��g��&�VfD��h��N�`���@���"E�H��Y+S>O$j��3�w?�� �� �� �� �#X
��t[bX�i���)2~\�e>�����R����I���eI��p����}�<iDM�"���_��`>	��y��UA����%��k�������'����\Q_I�lTN�f�M�5��L���������{Q���b���/,pn��f`�[	�q}��D<g�5�zHY"�hl5.C����7$����S:��p.;=�z0�����*�U�)3A�������8��Bg�;[�*���u�����{b=�������/�Waa�'�u�Y~�D�!kgc�3����Rj*����%f@5���'�:�,���vk���X���Y�(���0Sl�$����w�K(�~e����cQ�4�e0$�S�c���&��2���u���	�6(��s[��8�u@��h�?���b2�Yl��56�����G��C/[��5s�t2<[�m9�1B�����
�l����la���k<��G��W|\I��j�\e=R���
�Y#�Y�/Hb��5����v���l�t!B��B�����	_�?�p�q��R��X�Mn��B�[=�S)��@��{Ul]���pM�"JV�N��qZu���d�8V	H��!�,K��+ss�����u9�"^0h����Hk��6�{Y��^J���V5��Z��X5N���-�������1Y��B�1���?���-@���ke�x~��Z����779O��F���:���S��y��������AJ�IF�0y��:;�ivL�9yL5~7���a�x���d�C�������3l
�L���a/6j�$���#��$!��\���sC�:���_;8�9f����@��
�6�(����L��{�}m�����DR�d����k{l�K�56�J��0P15�5�ig-3")��@b�� �1$HE	��*?5G``#G3gg��-w��;i��&u	��:-�����k=��mf�������������:�M����l�b����G��Lo3��;e
��t���;��;�=��(����T=^?�p������x3p
�������3���������8R�F��#f����Rf���7���p��@���Mb`8��K�NsW��B��;[��0�u��!�8�'!�'qb-1��'__T���H�2�pZa3���6���<�{H�e���	ya
��WE�JD�h��bY.>��pF����F���������x'Z�~6~��dG�h=K�������U��eX�m�y��~J�!`�Nq�
}�&���J��l�v#rYHq�Y��C��Oq��DN�f����y$%`���"���i^9r�^B���Q@d���bd������p��1zL+3�rg���2�6�=?e+�}\����n���c�>�(/������%J���	�����nq�g��`aO�0Rs�����fR���&K�s����)�}(ixH��<l
�E_I>}����V
��!.JE������bWP�	cq���1�X�����*�"F,�D��W��a�WEQ~���R���G�J[$L�������������#��O.)����i���e�;���y�86��'�2������!=z�7��,�����L�!}U�tf8w>�i������3�+��G�(j	��UV���EZ���f:,k�h�9;���e�ad�3�A�O/1�n@u�N),F���7�<k�_/h}0���s�{��bO&������6n���K�]d�I�E�9��Eoc��PYR%9�����H�%��Jj����Z����<�iL�}d�����Py���E�������;��H�+���M�5���8�'����H�����]��������7��N�������>�B�I�)�S����}�%�,A~S����j��}ws�^_og��03|����`s�e�:sg��
��������?��(�������v��./&��7���v�B�f�
�o7���H��q��/������+V��(Qx$&VLRd��@$JA$����UK��9����R���c���RLT�P�C]�����,d8YU;�6��N�E������}a�m�N���w�B�=�C�e�~�r�%R0P�0@���?FiJ��
8�v��)����4
�O��+�0��b�&��4�� B���bW�E��a~�����L��"�.�0�1���������/?Kc����������/�^\_R��g��h�S��%�Ch���:�'����V76��K��K(Ol��3���	��]c�����yLl?d�Pj�B�������<��G�D2����w�*�O��4�f���}����zl�x �����hq���*X�P������P��TDJIB��<s��9Eh1����3���9��4�;-���_�lxo��
/2�E���;��w#�F��i0��<
Z���G����
���C����vs������l&L�){/�Dbw�g�������$�>��AS�T��i�m�#�����	��7E���0��H��^-I�M��rx�t����i@���#��7��QC^�����!D
�]C���4t����kM��pm�����r��i� ��im��IO�������(�#`��@�B�.������<Q�4b�KX=�Hm�K��q�Y!���Y	�s�ui69bOI����qCUC���%�>^��
/�2�3���z$��#�����H��m�dfC�K��B�=CJBF���$	��$d��p��@
�Fo��0����1a�+B�!��fYHT�����������f�bx�}^di�)���L��Aw��Yy������w�p�<�?�%D*q"mzr��
���+������9��~
B��v����.��w�����L�;� ���W�
�S����.���O���>��5�fl,������r
9rH�=7z��(/��mZ�;fXkLF�	�'.�����S��5�_�qC�������^�{���V���6��a#G[���:����!�yn��������&U�K�~B�6:M -����]v�s��{�����;I��.�|[Dl��Y������}M�C3���u��t�b��	�H���k /=qY��}��O�q��
Vp��:����
�tCA���A68���x�2��L0$��`)�vE�x3����!�;s�`�aK��MK�EaF��j�?1r�x�8������������gw����1b)���S�$������o�C��fnx	�iyf����1+,3����s���$��cTh���ty\T@jFC������:&$�+18�+9L�+1`|�JH���7z7��6�������P�r��]��O�n�4��u���:��N}��u]���:��;e?����8�"��/��1���E���J8E�~��]G��9Oe���-�J���8x��br�^�E���~g�H�L.�E����}����SU�'GC��t��� �m�:�Y����B��?��8��Q��G�a8�QW��C�xY����k�Z�^�|��:
'\4�6��c#���&��Dm}�	��y3J�A��Dx!��������'��EO��h�u�6Y����_�S�W~�t@��Hu�W�]$s��O2_��|�Q�Q�j�d�"��v���KgHS�8�R��nDz���>�zm+�����X��p�o��wviz��r�B�&cj�e�!H��^P�y�8�SF�J��r�/��Q/�b� ���������'�|I��j�K���A���"Y>��J�Z�'�7R���r������w��d�5��30��/R���DG�C+W(���S�J���;��qR{)��5r��f�`M;��(��T�}�]�C��m��:�,�N������{�?:0c;
~�1c�~I�df��/L�,�����R8��N�g���0�m��A��h����0WAY�����(��
G1[���w�7N��l���7f��:�H�X��W��\x��`S�c�&�!i�<)���HA��Yu8I��'�LUi�Ll�i���K�x�"O��Y�����6����������]&��!o���UB��������f�����3�V����3k�\���s�z�ol�~�T��A�;e��z&���T�]}������
��E�;@W�7�LZ��yPW-}ld�Z���qS���GW��%��7
I�$��BR��_�g��^����X�wL
{�������'8�un����SlJ���,�?}�-?��
nx�;o���5
_	����K������;�=\=��x#t�y����qv�.K���v��p����7�
G��X�Xl<������c��!7^�0@,���kC)3�WQ������1�����B�5p'���1$�+���%�K+Rm���Lt�7�U�T��F32����Q��/rI��9�D�1K�"����S�.��,���s���+����[�E�����K����;��@�8����.��5p��B��=���H�@:{fl��6��U/�hTN�<d��8D���R�W��}����@�M����:�~�����W7�����SU�^�#ZL�EOp��'*�R���o�K���������^T��*��{�:�.�4�W�i���a<���o��lHy��nr��_�vB=-�P2Qd�,�OW���]u�y�.!6W�K���l�/�-A����'�R�QJa���z��6���	:�u�7��}s�T��x��"5����ww��1Su�^��=��E�����1�&�n@{/�i�g$��5pzO�*�/I�R�e������S?*��A�h8��x�>Fm����i}}�Y��u������G��U�>���Kg�<��H|���FO�z�>�����]i��H���lLL�8�6$�{������F	�����vR���1�~�S��X��bmo��^�9`��&,8�z��]-��$�#vw9l/\I|w](�aGv*R���.���C9�F�O�*���O�{��4���C���G:�L��Q �Z���P��(y�������]\
�B=\�I��L�s�}m������K<y����d�1$c���IR��pA�^��)�f/DJ�{
�QYg�`��v���n6n9�|����"��W��nU���R��YWe�cl�� �r��z5�k$9�ByS�S��I���|����{Oq�����r1�C����,�e���������x� �&%0�|�e?v� ��S�$���P`<"$��}*��U�%�yO]�<�%��2=�������0�I�e�����4�~�N��-y(x�iDr������!�g�y������������p����x?�S���//l�ggx��`�Xho�E�XQ��<�y�u�6d-&��M��c�f���&z�!�����]+?����?y��Q��� p�0���.Q��[K_u��NR��!8
�4)��E�61�G���|�"���;��:,�T�*|0�3�))A����k����S`�|�7�ig����D�a�[p��N��]`�p����Q�6;���lRer:��3z�C���e��D������7��m�9�.����O?D�����Y
v5�C�U��2�}���o\NX��E]�����;��L�����3t��}�2���b|�
)@|���n���X��kz�
WH������~A���D[��3d�t#���t���3j���kQ�vDFc�R���$	��K���X|/�/?�KZ������Q����-z������u��*2Kb�]�r�y1�����HID
�P��].7�}��)�C)U>_~�HqQ�0���O.������J�~
�_�=�=�=�=�=�=�=���<��k���*�I��������f��E��A:GX�B����M�]	��2�h�
!��v�G�5�������N����Pc�\���|����^��c��O'�P���K10MZ(@5j�w�I�Xf�c-�j��/-��|�.}�O�r8�����tl}�i|LF�������,Y^v�uI�#���x��������2g�y�w�l��+�s�xEq0)�#���Cn���������w������C �
�y_�����:
����J\@�h���\A������@�VI�
~=q>F��k�v���������8q�[Z�`2��1I���!]E����
�}>H��Q�xh���-Z�Y��(�;���n1I���`���I�]zo�f��d�)�Y�l�{��'�+�dm"�K��_>�C�����DE����������_'�Dg^��UVZ�@��}�:�o����S�����������4�2n��8�����V/�G0��_���Y8L���M
�1A����`������`>-m0-��6�x�nN2�M-�[�z�(�vli4�(?-J��3��t�WK&h`T��50�)�bA}I����p����2�1�8��z����J*	U����O�F�xG���n�JR��$����"�L	�,���i�6���(z@��;�����F� �ZXrw�D��(F��e���[�#{NQE�����3T_������H���;Wb&
�X�e*�6c��u�6�������U��oX��N+A���7��zG�+���$xW������4���9����~�`�u/�GwX/IZWe����O�x�R���S�
�����b����k3F�h��>�s���0� ��m(5e�����~fZ��a�������m��<�i��x��2��}���u���B8b����x}����Q<�	+ �e���{�)����0�r�V��dD��J��-4LVH�B[�r�m�7�'rI�Hr�#r-�S�1���gy���)�$;��j+���\�#����:��J ����}j�x���?27��.���U|��u"������C�
�Q�rA0�7�L�@
�����	O�u���9�;�H�?�U�n*g���YKo����W��������_]C��j�dG�����jM�*����/�[j��}�>�b��>�0�>scJ�v �}A����X���:�uYA.5fHp��$7p���ae?��7��7�����
��F�!7TNw�P��w��U�6;X[���m��$cM�n�V�b�������41�._��(^�o����p�&<�l�6��
K'Mzt��P|�a��%��v_��n0�J�#�e�Pn5;zJzYx]���O����\��;����*��3^�L�9���4o�+��	�q�����9�.�E�\����G���Es�w���J�p�~��V���]��BrIw��Sl�x�~�a��\+��!�e�1�%C�w�Xr>.G/�	��!%�Y7��!�P~�������5@,��.B�D7�t��	$~������Of�Ce��7H����u�X�{�|�x=f���4y���	#&)DZ+�M���C5vc����pe��'�����f��
y��#����mqo��O�<}�4y�Bc�,������TX&���%��y�As ����T�g��BT��;�PK��6GHu��$ay_perf_reports/12GB_preload/ps_2_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]y��4��Oa	=[�4�����Zq�}\B(r��MB�e���3N��l2w���?�k��/{<3�����wTy�U��(�b��0x�����,f��6���)�s�n�$
�^�����&,;��gk2�L���l9��k"v���r��m��x��D��a�[w��u��<vo��!��(K����������������/�H\�z�b���7��"���_}�>�]�����.��0>������A��E�~�����������)��������~n��m/mv��l��MoeI�����_7n{�b�."fKz��;�P9��w����D�&K"��x��a��������Y�j��{w��S��������W9V���L���7Y��p�����������	T~��{|���s^|����}���6a��|�o����
z�y~��C���o>����I�r���
���SnT��V��kP���n�����(����P}:�-���'��X�Fp�1v��<�O�r}�%���Y�)�����:��#&����vA��,��W���c�6I�Sw�q��D��E=y�S?��"��������T��Q�vqqq���L���;'i�����;��G����^�QN�����.y�'?U!��?E�d�lE b�u��TbN���e��c�����j��>���N��p7
�������V=���Iw���&��<`��b����=���+�qv��AV����X�~�&�O�p����v_�h���a��8GZ�����8������Oo��w|��BK����e�N6�M��eJk@rN],�#2j/$�0u<�	���J1S�@������L��D��4�-�@������NV�����,to�s����=rj8Gz��I���p?����,pS����$��_��"M(���(����������m6"���B��H��������b���H[4Qhi��4��O���<M[�Ph������@r���%2Cp��9�b�'�:r��AR8GR,2)���x�S�x-G
��o�8
)'qJ����m��Y �0*����8�*s�$3�D������J��9R3%kw��_g|��;�����,%��`�����[����QIi��2�T�{�� �t�Z�h�����f����n�D���-;��������c�c�_��r���p
*�����d�:NI�9� 9�.��5'�q=�dE6�	���3qwa &��w
=8%US*U���<�n��5�=#,������9.l-�U��BO�@�S���-�:p��QzS��BS�H���O� �����4�;��b�B�	��2����ST�����"�BY�H�lEU�Q6�1E�H=h�/�`�������;�&x����:����K���4NI��:�Y*��,�����J#��C`�b:��t�9m;@����=\I����s��Mx������{�f��n�Kw�%d��R�����Z� ����w	����w�/3�H.���PMx�sMa7Y���~���Z�)�^-�k�@�#�f��x� ����k�m>���.	���vuA��iNuA��=�1���]����9�� ����b�g��.���s]����0@��W�}�Z.��.�|�	mV���0uA���]��x�Zp>��55=m��XrO����t��j�����f�j�
���{emY��@1��J{�,V���T2����tQ�����1��B_
��Bh��i��Re�-K�������tQ��2�-�X���ZS�m2����8i�j�>��KI��x*F
�]����u��� e�L��V�Jf��1o���U,������g�R{����>����R�&/�w.l��v<�x��q���<�����{��
d�u���&�� S(j�i��y��~��t�)I1��:�����0#��,rd���n��$����&��o��'�%��p��;�
wCu�*}[�#��?�O��`
�B�D��q<���E�h����yQ�crr$������`L�8%!3jOwN���&U�T�����-���/]�S���sr����S'���5�QO�)���,-��C������"��qgZ�$l����G��h�����B
�k�+�@�������$b��l�F{a�-�1�;G����t��������)�"�U���a]8��1���o/.t���E��9�$�y(}:x���*����.6��P��A���'pJ�L��;�<p����as��/m2e8T��W�
.�EZ�����j�3�k"s�����rAg9>��t�7�������$B��6�6��QC�`���$���$��1���)pF���K%.���<�o�+$9���^
W�s-�"I����@�j��\�����46��?��%���^�L��['�e��@�;i��
����o�T
��"I����pv17���8�"�p��#�����q�EH9��#�G�0�z|K�zH���B���m��9�&
��H��rd�'�9�D&������&B���9+�\k��; �U�<�yXa��[p}1�`fC���@��}��|x���q9���g#�i��Ic�9��%}�]}������������<���-T�Ao��&���A`�A`�A`�A`�A(���8����
�0�]����&!Ju�<H�4�O�����L�#���+�D��Y�H���SRD��=X�5zE9��Zr���a��A+=�"��k}���/rQ������������B�N�`a�Bu*$��!���_�\-z��`
wm(����-t�M�{k�@%)�7��l@��#�����K�_U��f���<H9f�L�Z�u�����,�x���/�������N=�0 #���:w�P�<8��� ���q�)��I���g����)����AVq����.������u
�$�|���(]�%�����BZ�8���b����f�Z:��4^Z��_��p:;�yr��A�u�ceg^}���BM
�����a�q{Sv�(z�8D?�|*�Z�V�;4��������"����0�Q�e�L'�����Q�K���T/J�,���8��d9��V�~��^��8�p#�3��Y\'l�����H�A����(�<������]�E=����v�:�[8��p����Q4u�4���US�����YDLR��]��Sw^6:�>�2�1�r��0�JztJ��lb�X�|S����!���5���!�XZ/�z���i��R�Q(��?DIslB�K(Yz�������6��0��5[\6��)a%��@JT���1�'N��U�������	���8%Q�e���b���V:Q�8D���O�����D<��d�iN�����W� ��q��.�*ny{��|m
�j����mk+/"����:��s/C��J������hP�;������$
du���S%��j����7*YM���q��d����R�($��hP'�//w���>K�� r����T#E�<�`Amb��2�p�[K�H�U%.JM�W�3�{
�F����Y*)����:��!e�R���}n1p����:Xfi��_�0�������]/�O�en\e���@�_+��e�p4��L��gp�+R�W�Yu4e���6�QR2���������SRB��[u�o�\xM��^�R��2���a����7���7��l_A�[���/BO']s
G����O[�$���h����M������)�Tq4���� 
[��SD�]���e�cb6���$�����m)w�
6���n�W���>k\�B�������Gg�:��B�*b�H�~=�yY���A
���$��"���)�&����I�8:�D����s������"�#G'��!N�����'�2'W��6j�Gl�[8��trq�`������;�@�s�p���/��4��/~��Iv�_�������94�+�,?��w��@�z�G�y�A�T�l�Gky�C����~��-r��v��TN�����>6�L� 
f��A��0�����[M�J�:��R�U��]g�@�3���I���r���������E�g@xp+Zr6��k���nr3������H!W����X19�	�t��G�4����h��<����B�� i�K����D~pXn
�������y~:�}:�U��!�Mfr��iB����UqvX�Q�)U��3��/���Bt�*4�_5*U��R����]h*:���L���h�&��B�����e���/��BYJ!	��JO���tQ�'9p��F�Z�ho�]p�������k��mZ��v~��<���XFXFXFX�5�l��������
S�M������h��eAY������H�G�������&K�=���S��`��##���`O��~ei=M�_�0�=s�G�bm�3"de�Q/�(�0�����Q����C�~���&[n����cy��i�N����E��;����;��5NHJP[�	Z*�8��$�p/�-Y
��g
p����63idTz�2W����S
����5�%������!��PX����F�lD����A|�HX?>�2��z�-�#�g'��>K�@�&��U�IMi�Y�����e�q�n��d�4�pS���:�Mu���lx�3�Tg��^���MC;�0^vmd���7�4q��7~����C�A���a��8KP@��[�PL���CG!�u#UM�4���<�EW�c���+�����{dz��3al:��MT��q8������)���h�9Z��hUm�XV�s�����*��@j�J����G��J�j������-��XZQg�����h�%�=�tDb]�h�"���t�]�' Nl�w�I���4iJ�""�.���+��0��p�dY5��{':����O���E�6����%�]�]�R`���c��a c9��<�(��_�������x�}����������c�����I�{����3�p�8D�g�+�G��K.����	�f���N8
l���,c��|�o�����a��YT����6N��`���/����}
M���������t��;q%���W{�p&���?��Ki���a�].K����!����C�JG���Q���&\���h9��A��bW{������8�%�6��8�����B]G����g<uo�~�u�� �*����p�m����R����l��M^�9�F����D�BS'�]�u��*8:�|��Y���S�f��.���5�M�6��q(��v��v�cv�������apH�����GrU
���`&�����\�d����a�16��q1������J�2�<x��3V���	�[=��B��}A�z���,��}Q���m^P�8����C�S]J�_�������D6�fl����T�	&_4�'/��B�}�!����M $�u�h��
\��N!���A������YZVbG��>R+�K��N�T~������Dc�Z�������:��8T��$X5L���N� �n\j[��=����5:��eHW^FD[T>F[�b��z�mq��o�I*��2�$�!���1*E�������e�_l�<�M���
1��*��=�PS�0-�d�9�R��{�cR"�J�D>�%�D)�hR0�U�������e�k�avU�Y�T���R����i�a6Z�[����/���b�Q�n����*4n���
��4�����4��p�*u��A�_��������S����X�G!���?���%^�� Hd`�nq4V)u#%��(Rz������8NV��2�2�/�A'�@aW�C�My�Qn��V�����'��'�����*hl��dwK�/�-lO�/$(���K�5�����!����2P>��P�����uB����/B���-E�0W,��Ga��`=��r@�+X��#�	6;�P�X�#Lr�^V������)������8�?yD)�)�%%�%2�%>k%6?�����Q�m��
l(�Pk��tt�l|}[�W�������~V�~������^��i�������c��y!cP4���c�����#���n[s���&���`��(�a�m���#��8��p~��T�����o��
v�m��!THZD;'�g��'r��������Z��f$�
�*�~���[���n��\�q��@C��y4�_���&��8����-F$d��D8�����L�B�
n d�U�RA�����Q'��IB&2C�4:���_t�]���G����+?�S�D����_�9��[��K���lx�|�.�����8��������Q	����N�A�j	1�p��8�8cR6��,H�b_	6���Mo\���������?�)N���/uhY�����d�V��96��#��������1�ja8%�p��aVAt�d4�T�m�,O9x!��	�E�}r�rq�#��[�'sXJ���z�I����Y��QV*C�21"��
#��M0i/�/)�����4�j���}3:M!N=��c����tY��u�8�n�!�
�!x(�h������W*��
%�1�~��0���8�'��j�Z����w�
c��1�y>?i|���%~%��%����H�����@[�L���������
�v:����AL�^z?�����B����;�q<T�V�Yi��IYi�}	�raeU��������|�l�Tp|��R���P:\��4���i������ \�BJ�������d�`!r�V�WR���
��e�����|�hI��C�&�<-�y��G:X]]t��������"���/!S��_DT������`��p[�����R����u� �a-�{=R��@6�y@6�M ���/O�
�3����&#��U]�q7
�e%�����{]gG�W���)�R�CgAg��8�O�����]vg���Gi>j���*���q�M�����U����X��%u�xn���[�=�?T�F:T�f����E�F����,���;D���D����|ro�
��y���];����F���������Od�AWC����b�����>�p���P&2���zP��s1����y`�|�&�L?�~�d��O
�U���SC���!F���+���\������x��yx
�{,`"�������rE�Q���r�����D��zw�����Omt��>K��Oe��U�����B���2��$���
�X�s�hG��E^���`�P���T���h/�H��-!��Uy��$����]�YQz���g?) s���6.;��dW����`U~����8��u��J �L�9G'&*Y��OT'-�c�N5@L�����������~HY`n�mh44�Z��l~h}_���I��8m+Z=O�Jy�B��cP��)�u��A[�Z�u)>����[��������6KQ��76�.6��1����x��Fw���S��q�S!�u�b��n`�)'DD��COg������HVXVV(V@�Lh���zY��5��� �3������&�7����BT�
Vj;Ur��q
ND�q(.sd����s���m�p�3�
��~e�V��!�<�����f�2�[�+�%�����g@���}�!���N�UP0�*�@�wB(r��s������p�
$���C��&�<�:�,�����B,�X��"of=��6_�<�}v'�F�8�����6IW�\V��v����kD=�aO����\[=c�W��]	!3�������wy7����c
��pWT<�>��l�
��6|m2�q���m9���=�c6~I����1�b�*f!F��3�f���<��?6�E��������ht"z��I����a��g�^m��[q���2,��!�`���&��i��]�([� �k6��z=�
���?�;��|��;d�99=;;��0	�����F�0-�9h������v��=V�����:*k���qI<���h���0*/{D;��Q�6�Aq.X�����C��B���A�A�����Q���B������,�%����/�R[�2�4��q��F'��=��!^��b�����/�@g�]����>"/�y�	���L��~����j���E<��;B��M���.~[}B������������05��xy�;o>�%��,�5����/`c
�K��G�<��Q�~�cVV\����e��h�%�2G�t�C�g�0���~�?F���~��V0s��������|�����y�`)��R.'�s��u�oG��v��{8��b�����+kW�gVmR�O�Y��� |��!��:�k'�Q��8|���=�������*�
,�^�Q��Qp�1�m	H�t�gUG�A9:����>-~���y���#t�C(3�1OE�c^��=vc�4E��?��?������%_���yQ{���m�d���
�8�!|�X�k�
�i��z?e(�8>��3^H~�=��|�Ph�1#h�G��4+���XT[�(�:�Q��j����n)�\�^2f���������C�of�q�p���b{X���������8�:�	��:�*�M�(2����!��;vT�h|v�.�HYq(�#SN8W�Q�[��$�%���=
���i��B�r~^�%����&J����Ek8���l�;
�C/�t��1N���:_'q%���G��!V������ek��O���1f�B����Z�8�,v���P\���0"��*K�auCNe��sTav`tY�����������I�]+�}�9Z��� d�����d������E���>kNLM���>����+����S+��_�����d#Nw
�p��MN�O|O;g�r�O����1��E����K/'�Z�A���lY{��_z@��O�z`Osv�3��%m��$��w
�4I��k���{=���8�{x��M	�j4�N�g?����y���F58�������-��v��2[���2�)~U�'�"L��Ne�����	���f��U��%k��������y��8��G��������7�/��n�K��W6�_����+�+��0N�"�R�}w��_���MV�R�)p|v�&T�([��']2�h3By�_������Q��(������b��_eU�,�8�~j�[?<l����j���%^�]'��jxf�r��PO!)���������iz�����i	�d�|�OY�
�E����q����y#})��t���.�����XoB�E��
�6��_.�8�Y����HL��iq��9,��@��+�-
r���IfH�����i��y�T�Sh�A\�W��|x��=||qzv_�v��������`�\�X���4��}�'��m�
>�������?�E��E���?PK��6G��,��I��_perf_reports/12GB_preload/ps_2_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt�]y��4��Oa	!@bK�&)���f9�P�&N_xm�������'��&M<v�G����������x<�L�����I�ge�����[����g4#�A��[����C����&-��?P��3������tO���l���i<1���0W�������lrf4����8Mf��B��vP�����QH� �
���V�n�?�w~�o�_BZ��������y��H���}��w����^kD#�G�����BqE>�I���o�oX�7xZ���'zH�{���,��\�����B�������������4H��<��'�5��L��MJ�\���$���x������A�4�������5���E���V��~��y2���j�uW�����`\P/���q� �N�����|�A�y�~����}��O$N��il��o�
:��'�<$��l��������wq{��/��C��1�OI�F{�i�W!��AZ����Y��/V�C�����n�zM�K����������'��4=����r�+
J"������X�D�?������QW�^m�W��?w9-�����_H�OK/�Q����=�6/
��y����'c:/���
�����e4��e\D1->�����9�4�Z���_�xe���-.o���;Rl����?�=yb�t� ���C���.o_��Y�.��3��/]+��V����d*�::���\,�;?/����9KO�c�X0�,`����s��NE?����A���"=�����(���r������r����Z������qR��KG�O8�����3���~u�������eH�=��|v����{;��<<��8�a����(�S/�Y���2�G���u>?N���\a��I0��.*�������N�g���'��1��w�m��\^���s/����6`�sLc��9�e�EA=�h��e~��k@������/h��K?�����ZT����-�9:�.1�3�_���B#������������yB_��`���p�����ipK��<N�b�{�A�l�����t:$���]��0a���>�����US���ttps��Yin��8Q�7��t�<������<�0e|�i����G�V��-�F���w��e�ge��tw�-������<���1XL��'�z�<��N�D�!����Y�=y��2���Y]�Bzp��w�P���h4�3�8�[�t:(ZM����@��<���������"���������`��~q����H���W��o0�{U6���U��L�"�%�^�`�2�JI�����,h.���9���f������s��0�eq8����Y��Yz�l�^��i��D���������1����ZI�)����=0m"L���iv�w���^�3^��3����8�� _��>0��q�lw��T�nh�H���*�n��|<�MI������;d����_�2�\����Ym�WN'�KP6����24�]�*[��N�&ihj����P�b� �r,[�U���bl6C�3,l5�(��-�LK���|�xY��N�t&w�!p��`��X"�)3�_W�^�([��\�������L�YV��D�����]`�������c[�q����%�����(�����/1�������
iGb��:i�x���m(���i�5��4��+�������Vk�����H�����6��(�F�������@�;������6E������}���|h�����u��]9l/��~z���`�]���5,oX�jX"��d�a�P�������X�h��czn�P/��[6"m�N�
-������1����N�F;�]��.,�_&Hwh�$��o	�0�Y4��s���F�Q�K|�Cl�c�3�$�.]�1��1��x[{�Y������6o�=�:E��TJ��?�=����0�Q�e�0V��h�e�L!	�K�a��H'��7t�� �t�5��G�^�d�On?���l.<��!gsZ��9����m5Q�:��kSo��$N�(cJK[�>�F�M�����C3�)ok��$�������y[���kSuD�c��6!������\X��c�\��	�sa������q8c���c��4����W�t~���=���0��CA��r�QV.�/����	�	���t:p.�����$��d��GFq�1��	v%?���f�9TK��"qv��8��KU��,����s��N�A����7)b���E���`�����I��m��i������m��{���1h����l��:��RtL����do��#A������C�h��@�tp���mye�Mz��b�k�N�j�`�w�M�yT����a�n��q��q9s��X�RF~�����!q�����.���-�z=���:�~[d���K�+��X��=��\�4kS��>���-�-P��VDPXQaE�VDXXAaE��V%����"���	+2)����X3����,&U��H�Jt���S��jkOGP�t���2��N��>�Z��4i�Gtz<�0G��Y
��2�nV�0�j(B-�L6�����D�)�rS�����������S���y�O�O����2D���A>�z�f�~q��|]���������Y�x�y�]9��]OVAzH��Xs?m�jyEu������*��*B����c�X����8F4�������A�����Y8�Mb���+h��%����
���u���L	l~�
�"�J�3�)�lB���G_��(~����� h��<�������M�F�3j���:~5 !�1�Q�.b��@���$d{��O^%���FH��-lh��4ki������q�fYT$�8���EX��K+��m[~1��bD������,�]�g���b��e��|��9�@��E�qV�f�� @2�A��2��-T��#$��a����t���
�c�������j�	��8��~h��`����B[�Z���	����Vl�[v���w��,�Xlb�m�I:�!������c�mk��z5�R����*U�m�TdY�u�9Q�IQS]����
�0���~)X�������TAi�PBA����	�A
��a����[E�*d�X�r��.>���������)�����������(�u�?J�"\L���e	t	a�&��E���R\2r�/F���(�}�?��d���c�Q���?*��Gu�3�x1Ct�qp _V���/�J�}z��W�)��X�	�n\\4��J�qYD<<�>�����jM5�����9��S*]Fg�TT����J"��H(2���!�[��i������S����x�ff�{�7Uv��h���ut\G��������gFG�I*3P��&����d�m�l"���5��bi��+�+��n8�Jh������B�kU��q��l��~�K����.���a��v���Mh���0I�
�	�6&���0]���%�y��C�C�D����
C��O����Nm��FI'"�:���^'":���^'"X��H�DDV'"x���t"���V'B���<G���&<,) �@�@P f0�s�@��:H�����?�Yy�T�����N�	u�H �?Ka���2tZ\
K��! ����-��������=�������~�����i�L�9���=D��yBlr���)��!�~���~�\7�&���b���v�c����1@��jO���QP�:-"�����|��*,��U2^��/9�1�:�h��������M���4_���3�g�[x���_�x��Q�3<�r���N���}�����C��gDsf��s�t���c�
u�i4F^��4~>��5�g�N��~o=�����8�K|�k���H����}���d_�`��`4�����1���������/g�4��N��@��1_������Vy�|��R�}�g)8:C�a�����w�[X���xu�r�����j08��n��'��O�;�v>�|"����D���r���'��� ��S�g\MM�t��$��j�%�E7���9���i:��qp�_j��o5��8~�����Y�D ��4�'��aW�q"��QU"���
i"W��hC%[��bh��J#2���hk�U%���:R%k��:��[I����Z���l����VnG���Dt���w����2�����������2���V�bjQ�������P���n�@����Z�����v�Jecj����u��P���,�u#���@�i�*wP:[e�����C�[�!Q�*��E��"]�,�6�n�+��z7x6gDd��f�7�+���?���a�������|�K�6��*��	]i�s�	:=*������g)T�K�'C�4�����h���v��H�x ��H��`���v�� ����|#3Id��tz(��z��k6�`���pOe��V$y�_� y�WX�0gq)Xt�!/l��j/	 ����'xvr����M5��U�.�s���Q�}A��m)�'MX� ���L�9�8=���vUI_xU���QK�
]�&�<������#U-gbrK ��� �S���C��Mp!�dw�%h����dm~����'��dL��vZ`E�$��1��tz8hG=~XVy�U����&�fV_�S����bB�C`�YxcVf�Vj�VB�E+�r:��z,�)��f��D��b�U+��f�W����'~j��+�_�ax�}@�-�"�����r�;���z��u�*�IJ�y'��5��!��Om��+MS��
~e����vJ�,Mo�������Mdn�):�d��5���[����Z��Z�bZ����$�\TP%�UP�E�D��4�g�m�����.T�#�^u���O���
Y�W����L������?�O����^	���h�������������h��>uGq�1���.����x����9����d�rH<Y%����!���n��B�����}�A���j������>:e�B��ayd�������h�t��k�tz0��<�]����E05l�~�������������4�tn�^w�3�b/z\�J����w!5�CG�����0&j�t����/�����t���Q�k�b��k�bsz���m�'�97�g&�$���Z&����3,���`JaQR��U]�:<.;<t����R]������:���<jz�R�����������US���t���i�'���x��i�'���x�G���s�	�r������=����bs����x�
p���iGQ��8�����X����um��.�MR�E�$s]O��"������P��i���w~<��1;���P���B��lZ���sh�����tq?v�n�b�&�M�`�|��>3�}\�e.Q_�	}���a"�}���a�^�������fn_
�E�����4���,�e�8�+���ui�3���xB������Y<	�������'K��?���mt��3�H�����0-N�����c�Lg"�+�Md�"�*��DB% 2*��;�9VH�XH�[�</�NK�.����e��y����O��^��o��\c���S"��D�O�P��>%�>%�}J����)K0�s�����9_xz����&���VDZ?��\q!2���I��/�cJ���F���W�tz@�s?��'	��?K���Ie=Gt0	�"�iD�������hbU��c0utT��a(r��"N1�5�9&��h�&#���P(��>��3&�b8�v���'tzD��������j�,��A���������h�����3�3v��n���!���������6e���Gu�h��������O+��Q��p�����\K��K�7{���4C��_x�����+�����&�n�&~<M���>���"���k�c�qTip�Bn�pj&yjg�j��j7��'8�T�����0PU�pY���]���a���mS�/^�Dvt�iy�2��F���vyOGXB��E��L��|�$^"�W��c�N�GT�d�� 9*$�
��zb"�9���A	w�n�]��p7�����K2K�5:��f��r��xUN���x3h#3����u�ooT���y�F�-S���T9@�3UP
�a������<� )�	'�����'2!�"-2!�#��)(�[���,lq������e@��j��,>�X�F��,aq����y%��+.�V�&�GG��xU��(�P� D�x�z��o�!�~������O���I8��_�#.�->v-.J->-N�,.�,�,^n,��X�X��8��8H�g[��e������z�[�8�7_M���k��X��o��>*)�?��/���TO7��[u���p\~�X���fR�y��QX���%���TZR�N��
���m�j�iT���^�u��	�I�c�0��?��4��)	O���Db"Bbe?�S��<G�$7�2�/������CY����nU�������uuy���z�xwk�;�����������O�v���?�#��2�����8����,:�U�����g`~���������"�?	�����8+#�;��3bb�h��vX&�s1#����,�G�	���g��]��?����x��*�����J�������X���c3i{�����G��gQ2�hpRziW�]qW��u�H�;�(�m���Q�zm#���g��-F��>rp*S-�-�:D�����v�y;vp*a,���^D��z��a#��o�z^��A��E��^�U/bR/bU/������V��>��Y0Y�-�"A�z�=��=��Bm��������^�F
�����l�u�x��-�:�6�S9��Y�����Y����uU!KQ��_���.�z�d�~������������Y9�M�e�M������*�*}��	x�84�	�X&q��\:U��BP{��	����oB�7!���u��n��������4Q�f)]j������w��6a7��/����������=|S4~�~�w���w��H>Ag�i��]�2�r�<�L�=�����?Y����:����������x>�h:�p��r;K���y�W,w<�Fa�1���q�4����o=���@��z�yM|� ��T?Ly�K��|mk�Z5�[1$�lz�R��j���SM&Y1N�T�����xb�4�������Z���'��U�%�R;�k�m
���k��H�/�o!i������������p�����|�F�1,^��W�b��_�tsi7�v������=�wuq�5^]�v��G�!�@V���C�KBT��yD��MT��p�'B���W�K��ct����Y��/��2��j%���X�Vb�Z�	�V"��c�o����� ��um�+���s�H?2g"���e9
7�r�|�������X�W���cW�A��	^��!��6����m�s�������~6��j~%�lH
w��J����x`��@1���Kf�����h\.����o7�n������W��n����c�����+��t���\�/��v�����e}Y_]n�lzN>ofc+�;l�%U ����r�T@"�C�a3�Qa�DZ3.��88������M?8�WS
V3�1�M01�]����bV	oK	k@�01*���1�Z�4k������.�Dm1�������R8`s�qt��2G/�t�Gx���/,��	��
���P/,��
���46<����!_�_���r�FSNT�J;+�1�������e��7F�7sb�u"�8�����wz��~[���{��0�sr��t����9�W�fW���W]Q����hxF����$���35@���i��`g4�!`c�E$���/]f�2�7���2z�a�;4A6�H�
�6�o���l��q��K�=����qla����)�����Ln���<�����/���-��~5�Sa���Q��|�fZ�%C(�����LZx1^{��������P�7�����A��*lXK�����F5/�n	�)VT�*����x��f�U��u�V�TO/o����z�
�9���h��}1�����(g0�[\?Z4��w�v�l�L'Gw��+:0�F�C(�u��r1�������'�t��\�f�C�&�)16n
v���5Y�i���>/�i:!��G�g��<+�?�.	~�5&1�!�}��j��PO������KZ6�pt?V!��Yi��F��	�H[b,�AkF�-����d�<��R8/���������*9��!�G|e&�f����!��.��.&3]�f��f�Pf��f��f:���dV�!������� �a#�x�1��b�b��ka5��4�X5��C(�!��Uc�#G�q�����{��������g~R����eV�������Ng��
+qt.��f��n�>2�F#b|��y6)��|T�m�V��Z�D�j�1����o�(�>|Q44��}FB^d����}���O�S������I�m�6�������@���
�B�����xi�a���A�([tJ�yi��C_�0�v���c����$O�?������Rp��gM�$�1n�sa!�i	�3K�Q�NN��
���*Iy��&N���1>�����b�����A+p����{*���������
 z��|�z����O6���z�zO���v,zA�Q>���8�1���>��}�����G��k������{�~�U���u�����o}�A���Y�#��5�em
��J���������Q#KFx�EX�EL��X��M��z�kq�s�Yx1�v�-������7�Rh0������B9���X��N����N����}����D��F#9N�b�4'Q�bBdc����O[�xV���0)A��Fr�$%��g��P�B������������h�Sy���&�7�;=��f�6@6���������&��i�+@�`��h�����p]l�.4���0�������������b��b���sq���8��c;��S�Y=����������v�yu�,=����G�9���y�GB�YY18��d�e��k	X@����T^-`�z����d'�K�[9lvb1��#&��6r����NnpM"���6�b������x�8�[����+-TA���O?3����OC)Vj@u4f��c	X�Y��~�]��.��.�"�.?���M����qdS"j��T=��H\��y���`��:u��a�:w+��$�2$#]~l�#W8R�
�jn��M@�u���4����ht���!��:6Yx���������H�{HH�����{62�l��`O�P�'�{(���=�b����bW;/�e�������l@�|�#��I
>��<Nn����hNn���>���;=s�u���xx8i��!zu��H���p�9��?�wmKN�0�W��#��Mo��p��6��c&�Y��6K�;��c�)�6��H��H�#�H�%��"���[�vyV��xYV�"��;{�#�4	�����v���!�/�\���[��M���Hg���N�J��P�|<�c�[���TV���u�%R�$!G=� KK�qM�;�G��%<���>\~T;�n<�.�����4N��JG'�����7�H�W����`��);�/�	�� H	���^��V��Z�
1���B��BY�kM�.!�K!kF!�@!l;�`�?���n�xo�e����>�U�t���R�F$}lB����m�MT7����z���:+�
�
��G��!�k�Q������$s�"O�>S�>{���'����]���&0�[k�D�v7���~��m-J�L)f?�Fs8����v���_�5m}�lL#a ������
D&�� �Z6~�Pf�Lo�	���������{�������um�zj��%��y[�d��7�������_����`6�9��wZ���z��|l/1_��l��4;��^��)�na�^"�N8h�N����Ro�/��}E�PyOoB�u���$/�<^l@���M�����s�W�{rq����F!�a����[=|�x���n��r%�
�
S04ll}�G�0���l�tM���$���@'�>%��g�?�=
�Ih���
����|~���/���<osg���m��+I0p��cOX�����c<��?�CO��_-��8�(28b�%.]�������2�>]���E�l�8���p8�{x��������iMdq����o5�w�����Tr���u�"e}Bvxa#������s��������a|ju�h��N���8d
dP(m��J�`V�S!��`�Z�L�9�
@dX���,:�����~x�}{�c�]sJ����&#�f���{b�`:������2���	�P�W����ACV��}��fap�0+?%���O	U��P% X	T���p% ��dW���`�����G��XAlI���&��#5��w���`��o�
���
N��4��0(2j}���t��7��H�8��X�#9�7��e=�_�9Y�����5k��rdt�
�i��mQ�o�$/_d��Z��C9W/4���kuJbr�~/�&���"���4^���k��1_JR��+�3����,�Fg���Q�O�������0H�������	�����X�����������x"{�a�8��'�d�����P�?�������e�>�z������q���5��=O.�q��Nu25]#�����t���z���t���4/PEa�D��@\�P]�~��*�*��(
�)2� �����cX:�yCm`�0��m���y��B�������p��c9��z��`�\Y�I0����P�-�U���[���`�5ff����c4������|��S��z���n��s������������0���h��su�]�����:��'��xn�d;��\�p���K>�GG�.����c�h��?�.��(����Pk�j�c���$ a�'a$ a$!a�'a�%a$"a$%a�'a�"a�%ad%a�p�g]P���H�{��)��1�]�1N�������K�\eI-��;����z���U���v��0���'��V��1�j�D������.��s_1m�BF�$rX0-�(cuZO�]�_r8T/��_��_�A9N��h�Cs{e�)��f�q���CMk@���AR�2y����&�8;G)��4�:�d���l���
�0'��F��������_XM=(���,y��1�O��MM����g������,�|��L�����H����%��t�i����������0MN�~M+���Vu��W5=nV�"��)��v�"}�M�-�N�m�in���'�4�����uq�7�yu]l3��XQ�������|��G��#��������:����YN��!�HN���3=�{_�.Ye�o;m���;0qc�j�s�R�=��,�tS���)�ph��6�X�g�n]��x��i����H��*&pq~Y@k
�aL����j���g��H�	a���Y����X�V,f�>Fh�}����k�i�Bq@0�A8 ���8 ;������\:Y��*W��;���"��)�����[|���_�^~�'~���u�F���{9HT��H���[^���E'��Gw��yL���C�ujc�%^��������=�q|Jo�����>�[Qz�`��O�DW+�+����<����������it\kK�z@����)dR�zI����\X��V��e�?���v��f
��,�L���rU�
xt��7���K��-Ub�	Tv��9��2�L�V�/QX�W��sJ��^40)�}��G����tpi.>��������-�m�50��qP"�)	�Y�W��a7]@|��m�=��g�u&1�M���e����Z��C9���g��)S�2<�f�2E�-��Vye�9����)���<NZ���j9��a>��(`g�����Fk��3$��	N�����`�i��cT�������j�ta�pg �1�M����OK�������gW��"k(a%���@�
P�J @	(�%@	(�*��,��`����l�������	'r�������PI)I}n����`��M��?��6�����!�����S�_�f�AX!ikF���#x��`�<��G8��<���gx���8�����i^S��oIyNG���rZo���}��e����3���OW�<�~��s�(U���gM�=��rV�P����ZE�`J�O�*��D�CS*
sS����� }�I�\���-��
g��x�������q ��S%�vY?�EB_N������$���������6�PXX�$F�v'��/��-
���ny��/h������M�_�X6����(��cD���h��������r�AX%"��j�\Yd�4)��y��u��<)���%����*���/�&k�	��V~O�v�m��M�*��?m�~g��1����<��.\$������7�d�x���:�(�a^������k@.�A	���u�LM��S�����`Eo[Ps�����
��������2��h���8����,H�o�R/l����jsVVl�syv��E�&-�f��<�jK�~���KJ��i1����VT]M��<Lze��e����9?Q[�C}E(J��nS�E�;����TF����<��/�����F�������8����I^������W�I�����N�A�S�~������x���&��D,0&��� 0���-3�����Z���?��	P��i�Y��V5�}���g�Zr�IWEy�<��w���%���y�V��S��������&��M�q}����%�@�'OK���`���|�)��\}�U�P�'�{X�T����f�EIE��3�Jj9�O��E�`����ly�V����p��i����
k��������Q��Bk���q<X�I�=n�C�������E��.���+�@s�3{�Z� �����F�O#��t������B������6n��2/y��8�[��Z�-��E�T�D'nlI��m����l�:8C:���@C�i����9��Lo��b�����9�C�A��y�b�����[=�vGOV�K���gp�B5t�Ub��|��.�]���QV���V(�T��N8�R�S�3�{^��Z�jY���������v�	�������2&��gY-:��-�5r>��9�uRm1��F���6A���h!�I��i'i#������0��N�E���X���!��_sY��Q�1;%�4C�����Zi,��N�C/�'��D�F��_���DO�n�K�,��m�L�a�cC��!��0ln6:�
k�� �����GH;�8��Egl�x��w�v�@�����!�K���
�����������������t�<���7�y��q��t�1rDy�@kB�8�����n��w�n{��:�{���1���xEs�z�������k���W�]_r���Zx���;fp�5fP:��G����(��(�tr�+@#��
�����T	�\eL��@�M'��fs����b.���c�E���=��:W3�*O:�i�X*�Y�������?�I����n���s�rT�����(��������GinG�z}���>�r�@#��r� !��7r�9�!A�	��7�N���'�}!"R�����BV���o�!V^8?��(7��d8DY~v=�:=Z�����G}�����
��H��+H�%^z#n.28H���,���f�n!�������E�d��u�t�h�����Xs�N'��cG��Q+�U��������{�I���2���Te��$s��k�	X�J�.R��AY��N�b�!��tr2�i��ZL�d;������0���p����k�$Fg79~U�P�/��~*u�t'�����3�7���K��;��'���d�v�q���@j��C��!�����q�y�Q�#w��Z-jJ�G$+����E��u:9�_��;f,����{�B@�*�Y���:��	k�;�4a����]{����u_�)[��C���:,*��i��O:O�VMY�.L�!������,y��T��`�9k-:��s)O����F�]M�]�u2�fE����;C�<K�X7��x2U��w������r>}��DH��b!xa�A{`O�A'`=�^�Ezw����Y�z�tB�nB��R������t(����!~nR���vUG;V}|�^�V�+�	�wk���x�eh���'_vp��6�Q9�bY�<��}�(.d�>��t�����:�������Q`����up��c��
h�J+d?�,T��JtP��<*nu�~[iO�������;���U�k�2��f���l�bu��R@q)��H\
H.7��D���Um�=@�@X�_�:�����,���D+6��3�tv]E*oy������y���+
_$���X�Q�����9�����ck��f`�����yjO���)�WT]���M:���2��������x�^eA]t ��M[>����`K��Yl?�p�����c(�+�cd��QD��?H~�S5��5�M'@d����p,J�i�`��{��\l�4������h����e��������~w���6%�������\��~�[����t�}=��9�>�48�������H����S��?�T�v������?��YM���c�������GQ�&���_7��l�����/��yqT�]��*:,��
�r�H�"dj;I�z��<M���v�!@�@n�7Pb��G���]��TD	�trL��5��5y��.g�����������k"��E?�*��$dx�'��uA��B��Z�bh\l����-cg9
��8��(��>����B�[�O�B�t����p���tB���:v�?�"?a�<�%�P��O'`qK� p'������N���AYHE'@�[W����lY���3���.:�a3f���D�T���*H�8�
[OW��C5pD>�L��?��T.�G�J���E�\�S��������g�[!�^������#�~����$�Z�����������b��
�&MH���t]7�/�>�C�f��DL�L\��������C#r�?����?;�q��y�[��sa�'����b8s�x2=_g��-�7zS�2�}��l�!����r�M��:TON��m�^��L�L�z��kxr	e��.<�@+����`�l�!��"�R�7B�Ou���W�v(�9���i�*�~\%�*�*���E�[*��#O��5/��K&]�d��~X��a�����.J4����X>o\k��d�L�RX^����:c��1 �g�3���&�xH|�����9�2l�		�kY�^�X��R��<�^��8�T���GD�t������[<�I�/h���?Kv��!��"��p������,�����}H�In�|��������u<R���-S��\(C"�[&r^�+�9N����J*��\�}���Q�N>�*e���'aJ�kv�����P�D$�����x��2�3���q��}U� C��9l��N���������+O"J�����������FGlcB��t��,�kl�AK�)��Tm2���n�B���b���le��`e�Yt����^��
�_�9`^r(��@�	���`��Ue���Y�U��fB&��t�U	�wV�JA�+��_��(7&�G1�IW�D[�1��`��u}Q�@n�]������'���R��c�'��w���}�A��q�i|U�� ���P�7gg�mS��,�Npb �"~���Pk�t$�����������Qlc�Q�uj!u����3@g�&���&�;\���K�>��!��I'{,J7�4����j������Ro�W;��zra�=�g}�?�=I�4����Q���5\��\��R\����/��\[d�����Z��d��?2W�\l)�"'�n���S��da�4�T���ds�_�U��8��X���:��'t��Q����U�G���yy�� ����N�=o0�YZt���^[Ch��a)U�����k&�y���OgN���d�K���^Y�?Kw����p[tB$%+�y�t��*�kIu:m|��e���#[(UU"N���E�U-�n�D&�.4��*&�� 1t����9��k�Z
��K��#2}�l��o�������D�o%���*��!����N�%��-xTu8�jx�W�AtJ0�����4����yQ�lMG�/;pNB�P�"�,B�P@"P<B��P@"�B�����o��^���.Im���PSbZZ��TqB�mW�1�a�*I'4@��S�3�:T7���-B�7��'w��R�h���O�t�b�,�4�&r��NB�'��������s�9�X�k����8�+:o�R/%|�</�U�������I�����>,��������SZNY��(s7+���y����b6h'8�������p�t��<\�,@����Vnt
�|�W#��c���j�]���gm�B;���]��j+�K;_z�B����i�/�i�)o�&��qwy}RdQM��d�����W�x�5:����\nJ�����rY@�j�������&�T���b>�>��O����y9;G_^<���l�������1:���AX�>�2����*'�:��������c�*RpFU]���L�g�G����U��NN��"��P�.]���u'v��q�������4��u�%ef��,DrW�����E��T�3m����i�)�����L�[]��?rh%6wT���7�X��f�o3 ���m�6o�X��W���vs8�����tIi�����c-�L��2�pIr��Ar��v�U�r}�WB��.[��3�\���49��5������&��c�[tZ�����������������;��7��{	�2��r�NL���EZ�/R���kAM:��3���I�?j?��
��W<:J�y�����wk�D��v�<9�������G���E��h�)�%g��X����1��Ch����D�,b\_!_9��KAo���h��L�� n}��(�w$��y,���lI;�m:c�e�%Z�OL�+U4$�����|b5:[;]�F,4*������im�W��UrS���\�V[L��l������q���[����zg���*~��<��-����f0���-:!��%BYW�U �_�>����S�����|,
:q��@��|P��2����t6+DYP7�����Q��v��IG���2������\?_|�j�a=({��x���n�<��sc�x-�I��v3&��{e�d��	�V:_����C�ZF�yK�S���k��I�D�c��2(9���E����[%�#�2�*Tu�}�-:0F���N���1�2���S��s@�5�-h��Y��O������`,�=�b1���u6�u�F�at�i�]���a�&[����o�g�}�u��N�I�U����p���
d<��c�tZ�C\��o��z���{?R_6�]T����J�*U��j�#"�%�j�i����c�`�4i�K��wp�gjTr������
b�
a�G�Oq�G��|��c3oU��P�y��bO���A�u^�2y�V!M��f%������[}�X?y8�~�d���HU�w>m�94�<�t0����}���_�V��(
[%Y��"��d�p���vW?�9��rH9$��n��MK@
��@�6H��Y�Ia�4Md{y�� >�et�{��S��+��x��6J�q5,G�u�Z@�Z�����sn�����8�*"����F<����6p�9wX�S�o(�o����mq���^���NP}������	��J��c�z��"e�MQ���-0
!���X�$$!B�d���3��N�x����:�o�K����8�� w��	m����N�[J����b�4��*����8
nz����}]�
���c�`�9k�p�����K]�����}���k�-�����,��m�A�e��]���b%K��Vn���t~'Pd�J�I7G:�����z�I��O���]��i8�M��3���Zm���e��#P��G��8�C�;`��^��$L�������)��9M�~�v����tt��0BR�����ZR�9F3
Wy%v�W��='Yp%I�����1�D�9����LCF��-�D+"�*�dZ�F�$�1�������;����)���4=�����[���h$��dJ��mj8����#n c��o���sp�����%��mZ�/-��6�2WDV�!g�_�Y��q]|�l�|������R����?Y��C�
�@���s������'�qlJ��Q����&�
�J�����
�a�������I�P3��5
�TUhc��96�B>����K�o�)'����5���#����m�5��J���xO_h�d}��,���������Y���Q�.���i�����P>
k�8gr������!�Y=�d�|)��EE���cM��fH
g[|z���+�X�9��7T��b��<O��o?$/�r#�s�}F�+���������xt1�Nd9=9%��� ����I$� 1<������sQJ���G�g����PK��6G�fQH&��_perf_reports/12GB_preload/ps_2_workers_12GB_preload_2.7_selectivity_8_task_queue_multiplier.txt�]y��4��Oa	��-I��EB�V��r	��M�N�6	9�����sm��,3�?v������������:y�|^{�x4����$
�%/sF.XL�������nC�-�w�YH��Kh�?�e����M����^,��Kk�0�c����fe��	9c�l�5K� 
O�]�l���4�.!g�&^��@s��u��|I�i����o|�z��E������E~dQ����w�����-�h�}��h�|�G�����|����/���6��H�����������MS7��I�>��=�%�w�4yg����&���<���1���i.	dg�,� �����w��Y�6�nc�Q{�xQ�
v�������hT���w�����t������U��.g���6��`�27�������$���
4?}��G��u����_}���_}�	�mTT6����$������A��m��/�������C�����D����(*�
��\|�E9��&��$�Y��Z��f��c	��@�.�	!������W�����|���y9�\�6���C�
�����U����C����~^�B�}��>�F���r��a�����|z�����"����X6�4�tS�{��{�]���k���$�e��������O���OP��}��V=u���kga�<����������x/�td�K�o��da.������/yK^�,H�K����3�q���ab)]�$��;������\��46����o��
�&��v�}��@�tc�F�/�$������o�$
\X/���x�������(S��
�b����.��w��3x�~qE��������������o]����c!K����|9�!Ho�=s��.��(�W?�h���\~����7�x�M����	;��� �u?y��.����Ec��6�XF�,iR0��F���h����y_�+�qzX~��4�g_F,K���B����Xzyp��4��6cEs���1����.A���M��0j#�D�5������ga�4-:����n�90<J����X��>��#DM�_�`8tp�y3$)��������;t�#��`D�BhHX���y�]}���I�C���0�N�V�k��\3����P������'1��-o-�Fa�:���Z;0���� ���Sos������>��P�\h\����{[�Z���#&�lS�
	�
�^��h����������p�S0$
��zJS}�=`�5�����0��B��0,��CV���w���k?�z6m��������V��q�{�k:L#*`[dC������X:}*�9�KVgP\��Eq�������d��
�H�f�w��}��z�wXCI���cPW��H��w��	Lb�A�j�_�`�M9lmP��+�+(��������F�������1�;��z��.J;���_�;�z�w���Gj��;�D��Q�
��{�dnM8�]�/��f���]��m8;�Nl�Q&��0Y�wb�i��@3K��;���w����',�
�uL���}
����c�cDr�~���vDrg�79���
�	X�
��X������>�������.�e��K�F}(z��� �0�nYo�
Q��a����,���~���B�����5
=X��A�-�l6C��(=t���l��8�C�F1��a�� �|Vh��*������`�T
	���w���r+4��&S���EP`�+y�
��<F-����^L
��6���>���&����$���=��l�Da�t*h������U*����_aw��R�$�5/�Z��IU�Eh���&^��v����`M�����(l�,��b2ay6�Ad`��r}������L� ��,��������Gh�e���I��13E^�B��Ft,^�&�S���l����8(�3���Q��&���E��=�ulk�����s������nwtx�6}��m�i�mdt�z���C.�N`�����h�sY�[�,U��\M,�YZ#5�����L��~�6�P�J���m��y�V�O���f��hxh�8c�Y�o�r����'��l�kW+!��
z:������J�$���F{�'�M�h����SI�1�P32�S�*x�����a�B��L-"��[��)�6B�d��U,�5��5��0�!(p�*br�4����0A�b�KZ6�(���8+��d���A�m������Q��`�� �|��n���K�w]"�O������U���`;V9��P (���S|I7�<�f��Y��3DF���$����8��=���h5�5M�<M��/�Q���v*�m7 Q����C�BhH��F����,�k��-sC��8�v6�r��3V�4s�POW��i �FQ��3jo�Q4s�O,P���KSE�����0���*��/5U}i����ko���[c��UA6�f���e/7�m�^nW�*��dt�
�[��P���a2ey�T_RU���a� ��\9�
���RLj-�����[���2uo�b;��J��z�l����������(�&����Fd��Glu-�;�����Sy��7����������N�;1g��DS���#3K[hZ=�Q���r8�����

��s�8W�?�]����z'\���yj�hOm����x��M���
j6��T���RZ{��;�P���[S�f��Y����[�<�4�P��s�� �����y�#AP��y]�q��zf�-$!8� `��E�5���������u=b(�\����D��18��^9]n�\7�.�a����/��e������Q�!���8-����� 44,��}��0���I4��W|�E,��#2vuz>��Vy���@�MD��Hlq��o"�[i�-yW�1��*��QE��%����AG���/B~��EH��2*�.��A�
?r)V��k�x�!�!��o-����
P&��6G��1M3�a�����&�&��/�&gy�t��=3��7��,)]���"������?-<���0A/��GOy31b�BP$��I��	C�B�HL8�j�8J.`�����F
��vx�Wy&A�B�m�1c�Y���;������z�@!.H�j����!4\��1��b��N�pih���K��'���G,B������^�J��;i��#7��
�����~�VaT�K#�nb%���{d���k��LmwN;vvW�7�ix�Gk��������T��l����-�A��;����m��������t/<U.�'N���X�<�_/�:�����-�^� rA�W�L���B$Ft����,W
�n#(0�@V1�F#�	��_�]�T��/����?����!(0j_�J�0j����-�N]����7y,��g��f�v����:�y����L�Pa�6�o�E��L8������)T�4�X���3���2S���
�PMikl��)��+9�����,���%/.�	�w);p/���3F�\���N��E�R����{�����U��>�����`i��b�'�s4W�����R��er._��
`�(I�8;U���1���?Af?U���^�kZa�s�.�t��U'K
\�I�$�GS�$�.�B������=����G��,7�}P�Z^m���A���WuWl6:�k�R2K�9�Q�t����bZ
?�z�g�O+������k��W���Y��O;��
�u���Sv~��R~���9�>.Qa�y��k���)5]Y�8q�s�b��]�Nz��J�cjg��*��-W=B��b(��z�0B=S����z"�[PM#��T3�s���yq&��L�L��Lp�L��� ���:W��k�E�W�@EO=8�A@��kkaZ��������Q+xQs]���^�8�ndl����"8�DN�����V�/r��]�k�(k��(�<�it�?�E�"19��AP�s�(O<�����!�fW�+M�(B�
�e�C
lF�x�����W�[�Uy	�6"���{v���j�m���O{�����.?��4e����q�Chq����`�	\d\~c�9~���"�A����#M-z�)��n��*���T��GPh���SFz`(� �+����"=fw�����1�������+�e��L��.����j�_��>���#�P3���5���[�'�-�f�wO�/yo�F�%��b�(���D-�7���A��,)n����./��7����%3@�4:���-(�8�����/�GM�m$�����$�IaA*����Un1[\7���N,�y��Lr8�3���axV�O-6W_�[)��z��o�W{h�os�
#�E�%�D����������@n�m�����b������_������n~�� �n�mH�Un�1��-=��0���.la����
Dn( 2C9?�P@pCA��
�1��E�f�_^|�K�3:
�K�49�q�t>�IzZR��
��"`�.K5��5�	-�����>��0A�����G�������{x����<H�+�T�H�d�!a���]�I
6��@y��.�b;O��)A��!`=S��q�H.��E�K9r���C�4�C&��D�������2���)��IBR�1��TYI����-��V�Q@D&H�2������A�s7��s�_�:�PL�u����{SK��H��&���xh��Q����\��nn36����
BO������b�]c?�'Aa����~�`�k��I�d�?����k��|�~���A��5iM�AP�0��v�������q�T�(���X��W�`�tF�jGg��J� �dl���g����������0�#���4t��#�����4�V)LC ��l��u�5,%�M����#1D���a�I�r�/�j�4��q��mf����'/��?�]wr����b��WI����U���W�H����9-a���<���A��T�d��m���	��w��(��y�7|l��$���7C=�P�!rzy��e~q��^{�Z/��{�8_bQR���=�X:�G�Dbm���_��.��Vg�jgY��l.���l6+��+iS_�|��QP�/�s��H��Y}������9�%-)A$hQ'HQ'(Q�`���V�j�g�q+�
w���_i��C��Q�Y�?��+�����"v]��[3
��[���sR���DM�]u����<��V>�d;�m�?�����%Dr��,3�w,doj����`g5���5Rz��t9;����oV`9:�1a��}���
h��L���L���K�k�B�;����tx���:��|��0�{)^�,gU�	���Y�NS/4�S\������@X��q66��]���w�c�1]'�G*����#71����!�e��\C�bW�T��k���1�0��P mR�����L�a-��
��%;�/#X:����3S�����
�xk�Gs��W�Kc���H�����\�M8����Bg��B~��p�K4�v�����qwI�{��JFJ�a�uM�zc�V%�v&y>����]"P1|*Ex|��N��*�M����.P��>V�9���U����L����`���1���b8j���B�k}����U�!�GP 3�	��wp��B���<���]W�J�9�����,�!b�����c.����!,M�����&H%]."+��$�g����\x-���K�k����Q<��/�r�_D���W�P:�O��S�!�"C���j'Zq8����
2�U1"JbD����%Fd1j��w9������Z{z7����2j@���L���s��b"���#"�S3�"�:0Jb�� ��y�;6lJ|nZ�F����%��^3x���*_`��2~k����n�E/�����\�E��pN����S�����b���v��P�;�l��m��wBI�n�Zx�<�;���wl�$4�j��_" �,N��mpU���+}��Q*��e��2;^��%k��ykhg�{��B��l�O��6�������l�yT}���Q�O���}������4��wv ��������d�i�q�*n����+���,l}�k���(�[e5�n������B*-�J=rA�R%��:v���-����.�,���������G@^_�X�Pp��0�-F����Q�{��-J/�=�7\���>�+y~��b��p�g�
.�Ax�U�K��TM�f��ET�����.Gd������!G��>�CNlJw�yD�^�rG�#aAo���q������#������'�U�I�d�����G~������������b��'��Da,o���<�Ug�H,57��%b�<��{O�^�1p�e�#y�W�������`#eAw��O��H�����
�"��g9;/��r?�\O �h���g������"i�H�$RX�d��,6����-B����s&�
����������4��xk�\�t9�V#H��
�Qq�(�Wy
~k��6�[+���;���6R�Xn���N�z����Dk��o��������'e#��j�{��_�S���>O��b�;��
�V������w-=N�@������@T��7�x@��"��z�@��N���c;q�������j�Y�3��������,�����L%E�6B p�[�Aei3^_�Q!��F��b�[��
mG���
�������
���
����H��^�QH�m
��uV	�Lz%���>B^s�����������Jr����D�\'i}J-�d��������XE�n
��!�QOo�E������������>�A�q�*o�7�9��i!�Tz�{���T'S����R�k!IZ�Q�I�z(��������_���*�SZ.
���/L��x9�����������x����&c{����k���e��Bo~p��W���ps<�g��
FM�(����m4�{���
��OzW9-RY��/�����L��_#�N�Rh�|L;s��;'�0��4��yl���U�]���^�~%�n�{V6�������x}�Bct���eg�4Q�����u�?wT�@��a��WZb�<r���>���������(nj�+�7�?�k����u����y�='1]��������?h������CbS����C����H�����7�[�����)�/��J���2���$��bt�gt0��������?�=U��^J]�0���k������s���L+�B�����!��D��Q1�������O�<��,��U$."9�]���qst�H��3����D�z�~'���(/<�V���	���I��pFt�����w�#���������?�b|O�F�)t����G��}M�D���:qxb�Y|9����z�j|RL0�Y3w~�;�e����\�ey���x�z.�$��z��d��G���l�E�$�('{���Lc�`�����W���u)N.|.��E�������q������tBB��	�� Y�%�ju�O�r�t���������9sc�������������/�O�#X���/����9;'��F�ac�Z�az��`��|^2$t�:x6z�3����������o�I���z�>2���4#?��������Bg�)_�.y�5�����<��\Z�	�NO�W�A����1�A���@{���f_���
%cY�s��}��)x9Q4�dE����-N����u�
�	�5C�FY��^�!�V���(�b|�^\����L��:��suEz�,Y�b�H�nQ�g~,�^��Z����u)��q��|�������`U�	`��`�U�cB��� ����q��/�mjC�����Z�o�eJk���h��������������Tizp�������6?GK�`���T��}G�����_��L��b���������3���1���"g��[�>��-
�S��������8��WVC��+
���N3����:KN�U���4e�����0���
���m�]��Xw��I@tjp����>�* W3���]hg�����D�9��2��9������T�������,
>�P�^~2�l��)C���IG��4Z�/$������L ��:n��PX0( ���l��H��Xi�%��` !�9��0J�H������0��	��4{��fe�N�H�q�@�����
�����[S�q*��0�[IQv<�WUHg�W4XZ���b'"&d\���;�.�bXw��l`��
��Bs���	
��|>�����x��b�i��x�p{�����m/��r/�O4�����m��>
sz��:����}4Ym%����{@����6H���K)��_�%,�����|D.���c�W������8|������!K8M���Y*����r��d�mY�}����}�����?��+!��x��uX)l�	�uZ��y�\_�'�<�#����#Oj
����$T�����7o?}�J
��;�Q�L��1��}����h��]����^��k�E��^��XO�<H{���1�-,
'~<m��z�*K�R9� @�jM��	G��~��PII�������(]A������7��V��\�-����&��[�n8���(UF�Y��"����7��y��]IYz��\�->���0FC�y��|��>���x8�+m�����YIaV�m�����]Q}��NI��`,N���[6�.O �-a���U�i4�z#����V�o"��E�:�c���,�Y��T��Ze4(�L5���Y�.�P��yJ����&���.�����s�o-.�Y�����6]��RQ
�0�M�C�L�t�����j�T��a�*��'j�y
������T~���)�����/V�1�|�oiJ���:�Q�4%!�B��j}�t(���@�����L���K�_��p��	\=���*�Xb?�,��������Pn��Y������]���BI��zg����T!�_K���61��[�0�F����/�}$f�*J���qP�R�OoG��,J��>�x���(�3�-�R]�����r9K�!Mt�Qz�>�Z�k�������1�����x!_�E����$l
~AKR8��9$IW,���sS���9h�*�����l� ���w�<���;M��s-�%(���,��y�����W����I{�������g�Ej���"�2l�o�WP����P s��8�q��u"�0�i�
��{�f����!��[�r��H�(M��vP���*���_���g\EW)g�tb�ZvVl_�Y�q!��etS��R6e�����'�E���	i���XNjd|��t*KB%��H88������a�v�I�}�G�tMa����2AH�h.��c�%�����O�tk�Tc}�/`���x p��T�����hUJY	F��d[�~���l�)�����z[�UT����q�������S�������)C�*j��g���w�@��d��dT����2���K���Nk���D
�6�]LD�B�k�)��a�]����`�}9#�R;x��_�I���m>���(��c�L:�m!v�;��:7x����JShX.�KTm��.���=D�3s�P�t�b��I��q�D~�����g�(M\�"V����CjS����e�h�G�z-'<�kF����Kb����9��nFL��������%�X'�����,~����������YxC~n���~����h�a<�JhH��������	�tz#(:����w��DzG��PK��6G��M
`��_perf_reports/12GB_preload/ps_2_workers_12GB_preload_5.4_selectivity_1_task_queue_multiplier.txt�k��D�;�b%TR�����C�(O!d9��b.�S?�w<�;��'���]����$����kgfw�e���wY�o�"�!K�G�����|�t�i�#�x��LW5�-�,��5g���_t�1�X�W�����b+�~eX3E7]�+[����mB�
O���Y��g�*!����O�%��A��`Sd��*�Ew�������j���,��_�9_�������������XW�m�Le����V���|����-�M��O�$�Wl��Iz/��MS3���n>m�a���E�~8����My��!��fW��3�4T��<��r9���>X��!��7����I�������7����������������{_�\�}�.x�]{�6�`�q/�T������4�|���'�{0d���{�����(^$�`���
����`xXe�������o�6D	_���
�#6S�f������z*<)����o6ir�|��Y��j��[����hn���1�i�^�1|:.���`�g�?x�3ly~��'+�B�;��.�������P���60��)��>u����Y�Jr/��$]{y#z�m����������6�w�8���~��W���Ze����}H�����6Q����f��N
��QS���ztu5sGr?���} ��������b������U\C`P�����9r]����M�f��W5�5����!�	�A��:������N�b#��&Z��M�d������s���������N����������@ �N�w�M�t��J4)��Kt�Z�������#��	��,1��,��$����y87k�Q��bb�^�@UE����bn��5E�w�����{�uW�<��MW�����)
6������5��b��%b��L��?�(�x���������Y��f\�!��Q���n�U�m���
�^�o�NMX���1��A{T�ji�NP�i{�H���C{g����������'���l����2��2��2��2�����e��e�[U�������A0Z�A��"2�b1���9�-�-*"8[���������<�R��
�N8;�T�\����Vq�s
^;85.�b�T}����R�`��	!-��$RQ#V�T��{�����!���C
����u*c�x����Q��AP�;����av=����,-��R���u]�<K����4��S������F����<��^��r�ts�3��u�����R���N�����OKu��|�E�\���*Ub[������E�]������t0�1O��7��%[�rVZ�J�]�&Kc�`F���(��M��K/��B!Z#����~^�.��#�\�W������O���n�Gt��!��A���dk����(����.��M�;�%��h�\�QC���U�����#E�;U1�Nq���r�B��9���8G���,�`t�%!�d���P��DR�GB���Ii�"!Gh��,1�b�<�k��9��%��qX^�����{\�=*o�q�;�E�/N ���6��h��w�����wQ��?Da"���	����~�l���
x��
-B�y�:�Q�����<S5����,������^@I�����`pR��m5���r?��c� �,�!�[q >����Sgt�q$l�&�M��A$~	��'1Xt��	���>���V���2�=��F`�R�} `$7�z��!��K2\�!���2$����m�[�W
���9�CF�Q{X!�;�YV�	H7����T2O���e*�z�J������W�����KBT��QK�~�CDl@���a��Q{���
_?3��3����������}����g��\��Z9���� ��:�	�N��P��M
$��( ��a@��fP�7���
�&�������^�8�U�|������/\����������Ip5�s5��j&��������L���-%r.i���L���L����%�+�s?��E��o~���V���n8[�$R��������pTf�$�-��a�C����;��{wj��$A9�4"V
"tN�!����f;!�r���l*�������|D���~1v��q"�T5bR�p~��}�{��
bx�����]���M�8�&����v
<���i�1��<���wi� �yt��4�������g)l��0��As(gF��b�jkRMe��h�V�j�
�r���XW+�[���b��b�s�PsU�Sf�I��+�]!$:!���=���_�"�XX��_�uRMR��YR��(���`&h/	�52IU��R���x,N��}��TW�z�@�"%g�p�Q%3�s�i������,O�`�eY���� �S�^��W������C<�M�|������C
!�����1.��>�M��2ZLJ�I�6����x��OB�T"Y;�e�5�����.��H[U&%���Z(A��)~z�<c������a���0un0���[^�o�S0����UE���+ ���=��qf�SpZ3���2/�S6i���z~��s��t~�oY������Y�lQ�
W6n�lT�[�!b��g��pf�}��rxZK(|���98����������������1������R���O��'*b����I��KoJ}]P��u���������4�J��Z0��b5\
�]kXG'�#3
���������*cb]���+�2Xo�r��K\��^Sd6��ez`0q�l���-����IYZ�a��m�,D�|�a+A���~���p"��o"�}��C�y?�|?��~d��Z/�=�I(������k�i�2���c�m��3������ba�\����T��=0]$R-�,B������zzc�zk&��69��������}��M@����rN���R�����G�c�85*��C��J���������l��Y.��C��w:2���N���'�-�6�:s�������a7�����O"��~?W��M>��L��3l�S�����t��y~F�W/��DZ]!$V�A��]��	 R�W�p������8A���h�����v����R���a(�p����kM>|CN����j������Q�ONm8;��}>����^}oa���B�
��s!w'�[��J�d$��p�$���YrD���$�a��� �D�w��D�����pY����-8W�T`�^�����q���5�Y��e����r�.��`�����/��QC�����������l���Z�(��#�>��9w���f��z!L���RpvkL��2��6�+���[��������p.3����dd���,������I�;vV�6Y��>���������X�z�U��{����&�8��&����9Ro[O��}�n�=F�C����X���y��"�&Nk��#k
�����W��������9������YWA��w�w�}+'�t��+���~
s
�?������G��.�s`�Y�dO\D�"���:q���w9�+�]��p�$��Gw�2&��1bP���p����-���ue��6O�����$�[�6]o��@&��L&�J"�+��
�N�f3�	{�]�Ab��zKRr��=�e
�f����t������uh��@�I04P<�����1X��!U�	iY3D��@d���3r�h{����%��#Y�o��B��	�5���Z���O������0f���$u���E\�����#�W%���E�&�{�a�z/ �|��@��^08����v����??����E��kI��B}����q?
�w���u������W�p��
�� ��>]7<I��$���&����	e �y�.�[a�Z$��������s�|���Xe�����a��;&�c�w�95�8�B&x��~Fu60)�5F��d$�QMIF5eH@w��cO�t�,rK<6:�$��L��81�[������w$����'5!7��cD�`b��*��/���~�PzLN�1��ct��$�����Ip5��jF�jF�j&��L������q���r>s�|-����a��DQ��'a���K�/��~qJ���5H�OY:�P�D��� ��fh�~X�����Pp�^�=4���$���V'��6�~\����J_�Do�p{���n��L$��V�[�H�-�z�~�*�i��
d�.�V�j�a�f��,���E�����"���-�\�e�
�c��5���xp�1�#������l���8$�]}!�]}�������f�������S�8K�!�%�>�h>���j3���j(�.�k�~��"��v������v��d�����d�97�
��q�$�&��f1�DD@60�[Ix�A�q<��R��Yr#�Y;u�6j��A�|��|,�t���S���iP!��X�#W���u�ep��*�ph��u�(]����)���%�G�qQ����mQ�V��������������H�����bBH���.QpH�������(��`lT� B��EY��I��g�X��6���&a��uR���o6{���#������Y�K��.	������Fo+�lR�H����'���zD�.�E�s�Sp,u:x�,��$1�C�� �`pb��-�)���/#�.r�E������~^1%�]�)A�BI	jjJ�~K4����4;	����T @��Z.-��fQ��5��K�?^f 	:��u�`Q>��@�7=�
h@�H<��1T�%��j�����O�W���]}�,��0�r�$d�f!1�B��L���a���'\��5i��� -�@]����s�#����6���)�Wx����
7�0'��%����9�����@�xE������,�)���y2v�. ��D��HW�HPG@X$�%6BI�J�[\8�M��Y�1*�Bj&��'���8zbx�����zb�C�M�A ����)�89�O�}Y<��6���g����Xg&�$��|%)��Q����S����!��v6L���*�fi������C�k������k���J�=������f�n�e����$��"�)7�'����'\�a4�B�B����X��G��?)���x��_��n�8���+"��U��l��E��^��Z�����%L�4�=Vpw>;�[c?0���>%��M7��uln����`
�"��h\��)A����CY=�`�hyl�n/x�����4��	���.^�2���X��>g�KuL�g(��xw�SQ/�WY]��C'th����$��Si�-����wz���Q�{�-���k�"4~�2Hm��|yL��L�;HOv�i-p~I:�
I�����H���KEW��NHi��kR�K���4�]J�1��yI����%�������p^�7�K�N�y���8/�{�%��t^b�R�%�ju^��Z���d���d]H�p��5u^r���K4�K4�K4�@����n���yOB�/��/������=�!���wp�E;E�b�O ����Dv���T����B���$A�ZrpR�������{eP���
��{%���W��{�(6[�����+�����W��J�%d��%��%
4�5	��$�^�@wM���Ip�5������k�;Hu�����0h�A-(�Aa	���0���)��OAq
��hE�`����:&��xG3g�.FP(�jP�\����\�nb:�!��A�ej.C�szH������B��A��p�2%(<j@Z@j@z@Z@:@�@�@Z@
@:@��yiW,L��0:�����]����L�&��YVwd6UX�Hf	
!��#4y�v����X4b3F�v�,l+Eb�H��Z���)s��b��'��H�����/�'�s��'������z�w�X�Kl������h&3������
�����<�|���1��#�������{f{9<Q��"i�@��|.(4�d�a@���|��WG�����gY��f��\z���px��Z����w���w��v�O��Q�Df)z�t�R��^���v�H=j��`j�L��D�0���#�G0�`N[�n�>�" Sl��0���a�O�)�K(\���K,��2��9�j��*y���j����i�K�{~�_sN�6��\���2sX�*��V�	#���R��H�2t��pd-��V�x7��R[�!6*c4~<���z:�CWV�a�x-��Z�J��|�]�	6V�"��[����	a�Q�u���Oda���)Y
�x�/���������<{_��4������	�N��dv���%�����$�~JHr���E��8�,�IN4R)�8<j�;Lbq�q<�cipHFFHrD Z3�f�������rC��2��8a�)��e�x���>
�K����^=�z�n��6R�D��O~��j��ZQ��I4[�7�]���&��\���n�2;�����u"���s��?���.c+���~���'�<]������<�}�NR��5
������7�[��TT��*^��mV7���'�b�_�S��a�)�~i����}��7{���S���].���U%�Q}Q��XV����?��9���q���X��Tt�����\����n�w�M�\���r����[���&K3�������]��v
p�������v�Y��aK����`e������ ^�(��}4��gU�(��y��#�Z6�
�����SQ	J���:V�~z���1�x��E�%��n�\Z�	�)��u�}\�U]f�Y�@"�8G�}���l��m��
	��H�@,A�Olk����k�;�=`!D�`�g�����i��
�����5��!62ApS�NM�,�	tH&PM�� d8�>���@m�������=
7��xS��w�9���=�ZC{�qg�V9��5P�Xj
\��ZI�q���q��}���Z��U��]v����1��y��,N�,���jl,�$X����_��H��0|�.�(H#_/��~�p�.���\�m���:������2]���-<���h�����N������8���r����u��lg~��S:D�H��S/z�7f������m
b�?�����&����e�l�j�-L����(y>����PyzZ��/�����R@?(<���y�Syj����O��p�����e���}���~���)�]����Q�f���)�W�~k�sx�����&�%I��~�yF�i���iG�s��o���"|�Q��[9���4_���Ne���Y!w�py����"zKd,L�����Ck)M��%H���?�����S:v�����?�b�@��a�H��E�/���~�]�~]����>l��&�[\uw�_��H8{U ��3CG����G�>��:c����N���,2�&�z���7��S�9�#���]��&���w:=/dw�������6?�w
/c:����F��������������������T��_g��{�0w�#`����O��nY��lHg�u-1��wE�m
r�F�(�����j��������d�
����9r�D�\9��R����7���w��q�J)_R���P�p�!3aK�R
����U���0A"����?dP"D)���F;b�#.�!J�V������c���&g*5����RVG�]��H�KfIw)�#�.�u$����d��K"`��f.]������ ��A:<�|:��lI;������q��f"�u-����=�����Dxl��w��Mb�F��`��t�&���/N��J�������#�.b����0J�c�����	,C��%�MJ ��@5�j1r�#Z�����"���@��J��$��w�&���k��{n�p�c�W�?H���<fz7����Z���k����6fH��8��x}��qNL�\6��S�'�{l�K����;	������<
��W����M�,g��M\������,O&-$��g{�"fPA��$���:����
�j[�l��p�����T�C�7����x#-��3v!�����FZ�a^rv�f�d=f���^��M=��o�|���#t�U����T��~{�j�&�>�?�X�dU#t�����4;X���w�F��������M��I�Q�l�YJ��z��*��,X�^<a�
��$X ^}��qv&#p��x��&�G�����Q��hrwT	;�����+
SQ��!/�"���������eIL����(1J������1K���&��"�"�,�K,���Y�M�%[��>i���fKe�g�Kp�����Y�_��[��������y�o�@�������Ka'P��V�����;�S�$��~�Q�lv���,-�,!	zYb
 �4C�he��LA��P]��6B��,L��d�s��[�������B���j
6���khzf� �I�����P
#�	~���4�������c�q���`�x�|���d���+f�`�E����4A� 0�cH�&i���-!5�)(q���>��;�P7���]o����Iaw�c�����)./MYb�[�	�����p�?�����cW��E��PK0T0;������ao����������������&���1���������-^~��/Y�9J}`�t����~�*��D�����>{�cz��'c=��.+�{��SiV�4�������3W����:7����|>�����������Z��y{G����N�y'tX��!���wJ�/`�q��S�N��d�n�Y�'�v�b9�����v�+�h�����<�Q��_@�������k��tL�a�q���]�����z�*\Y�����NI��+���� �u���sU�TR>�N1��Iv
��
�����$>�a�b�E��J@��>��J������?RY��L�:+:Xv�x�`����(�/���XV`x��@����c�1����o�#��E	4�R��T�AU8S#�#�/��+g�X�8�QQQ�{P�yWfa��w��.�gvqy�!������m����B[VR��b]�@O��a������nL�d�}�O{Y|$PA@�I1�VG�A]�AsY��I���B�B�����#�H82���;"9qy�����8��0U�/���Z!:���
a/��b�.��H�.�����H
�����"�
�L��>�L�5������K�G�+�T�k��>
�G:�Gb�Gr�G
T���@5iPMrT���Kml),,�Y%��x�G�U!M�:�����V�5GUD��M�:�����U��.�YMx�	
7�1&<��F���V�``T
�:��a�����n\g�N'����XE!����2@~���"��v��$������K����CD���p�0����|�E���������\��X�'���p�	�~=���/�f��}�����Z*�d=gt�5�{����W�
��z:�b���_�7���6?��Z�������g���>�$.���y�o�����o�o����V�}��
�����+���/�X��M'���+b�]Q��{��=������;AgC��:�����,��vs�������a9�d5gtk�}�h�����X�`��tNG�Y���(`�hO�r����ut}c34�m������M���<?`Q���A�o����:]p�t�6�]�N��@Vh��F��8�Q�]
�
)V��$5�d����B�aLx�0�n����pL���z���_�<fQ�E	Nh%X�
[�v-�`,K-bX;�27b��1<���T���U����^#��(����k�I�/2c�������NLu�O�d��	V�?���)`*���������0j���(]51Uh�����<!Q�&KNX���Ts����N�����%���5�q7�;]����������N����n[���+�d������G8�@8g4��ocx;�c� ���.��A�"�G���������������$��������
����}`Cj$6�Hn�i0@r{�Fi,/��[$��HeX���"�	E��I�d$�}�w\�JC�i�3q�7�b�za'�.��vw�)z�\��W��V����lg��3:1#���37����+�)1%X�`������#�V�.	q��lb��bJ�|��!�.	�w�����,l��'�������:�^����d���iy8������z>��� ����x>@����ef�d�"�V��������d�V�l[U2mU��U%�V�L[U2lU��U�[U2mUI�U%�V�[�r*5zh#�E��&����xw�#�w���F��� ��P���Lf��lj��lj��l"�����������DZ��fS;�f����nRyx��^�0"Q��(���"��
2�G9�e�k���T���m^��#?��;�L)�u�]`=�kT7`W�pyp�b
HkL.�%p���������&��g����it�mK$�����'^���ot���(�Wy�7y�����l��'���o����RC*��H����������f��j����d5�t�+���_.���;�{to����}��������e��#�|�?������o�H����~[��L����������p�(��b�(��(D����|@0���|@"> �������$�/Ej�����:��:���Z�����*�-pM,�}	D���`��0�HX�b���
�����������j�����Pe4�
9����r{�� ��/�����J�-�
����m��0����:�������b@�����k��w[1���]��`�:}7�����-�Rr�#�x�t�������f0�cW	8\��(/��d�U�6@���AE��.*�aQ��wX���<��obcs?�����'�z�������I� ��g�a���^^�7W�����k��U�*�Y�wB��J��J��J��J��J��J*1FZ�Fr�F"9`{$�~]����%;������v�T;���P�N�j'T���	T��v&d@@�`y�������R{U7�_���>�t�Y��QYt��V��\�����v��+f��e��.��r�]���\��Gi�4�e��o���/��#��z0����,WJY����n}`il�,���+V���G��
�3�� N^��}]��3��'�+	�~����)|�x���W��l�7s ���_i�fu����dUF�����:�Wm+la�Y,��bcc,�&t�[R��������_�'3WM��W�\@QC�3
b�*\eE�	�cw�fZ�1�wG�n�;���w�
�O��0��������C�7�C�,�������E��;y�qK��u�r�=J�q���$>��f���6���+�UsBg���Lb����X���v�@F���#We|	�0%�A	eP�P%�A	dP��p%����3(AJ�jP\�@�l�kH�_N.�[:�������j9gt�1���t�(��7����U����
�
_����W��q���}�lG��0�K�~��d�r��}@�u�Z���?W���5����1p����@���������������ui@�����6Q��a��@?�y3�o��n�6�W��+�}��h��_7�(Z�7��Jr�W��KJrl��9C:E��9&��"g��0N��)��A`����"mn�g
�3���L�@:S@�)���%�B���*�yZ���	Q���\��=xK�q�������,d��K�C!��@��G<���?����Zp�����,��<K���a�s`�s���}-$�{,�\��c��y.Dz�,�c�(���WD��n�e�}?��y�2 ��<��Y��t�1��j��?�u������ Y��"�
v�����������xh��,P���\���R���}0.Y��)c�b�jhZ�����M�
�]�����*�mb	�t���T6��
h[���v����X^������E{8�x-��J�C{Y7;�����i= k=�k=0`k��5�5��5���l
Fl
�l
t�[�����K������w
��+L����W�("�Y9##��������8sA	�y�����0�+�+b�u_fv^����e����:���J$��ntVB��i���E�	#`�A�8:� n/qo':�!M��+�0N�:�@�~S�������P:�G��s��&<���k����-�1Ldl����*p�	BP����1\���U�����E������+v�i�UD���x,���k����� ��E/���)�;�c���KY�meQ�x�]i=������E�yQ���7E"xz�H|?yl�����E�)����{\�"�=?��-�,�+���CZ�$���xV�H�-InU��b)+nM�~P��5���ZoJU�en��{��>c?��I��5��I��n�Jk$~W���DQ���gU���x����5��J[$A��"[$e�����DX���i��b)�����<����LR��R�*����i�g���Eel�7j/N��A��/��zOVy�Z�l�g�r6I��������H��6�$��j%��VbM�Id/O�J�u�Z�5�$�&X����2/����?�de�������i��z���G:Y��Dw�C�k���_�����F����	���P�R�&�S��n��F�y����'
��+.z����i�8H���9<1���DL�$&�����a��*Y�W��@�d��!�t�����tU*N��=q��i���]����5J��2��Y�����
�hI4&�+�L^�LTHW�D�'c�9�2��/���/�����K�������)+^�����>;8�<;8�>;8�C=;8��<;8O�>;�(G����N\;q�V�&&���������?��5";YwB�������x������Wo����/���$aA~hQ�����P����$|��{k�}�b�k
���t��#k43�f��TM��!Y'��#`~j4q$��I��Yt�,�^>�>&.���Aw<|'���&��X���������[�[�j}�=��xF�o�T���E���z@�;���5^�:	���k��������3�����'�&�h��Cv���s�S��kQ
\�j�W�i��u�����Y�4?'�

�PB.k�����m�����#M����H���������{63\�c��6�hZ���f`���5�Mk00����v�ax?c�][9�ks��={�"�H��m=����v�'5�Sz��i���zy�Oz���F('��V���(�B�x^<�:��c����W5n����U�El�J�WT�N��I&_����Y�7E��;�,�p�CK9�fP�d(��$���3��v��5�./���(��
a�4�������/�$��Fn����#�R�IS��O�Y���Da6=K.��C�/*�lX���H�E�)b_�J�7���IJu�:���q3� �����B�r��������vr����S�A$��Jr���
�0
&MD�l,�|������J�G���=n���0u���-�?T��]��<L��r�T����T����b��=���E���\���~-x����x,�=����no���Mx���"�<`q����Dc�r�$e��gE�G�)Bl 7��
��z5��^}<f���P��9`x��aM>uM�����Bq�����Q��c�]7���r�6�y��Z���*�n�B�!L`��c<�S�S�*_����q"r�k�� �������*���-u�����,[�����B�S@��"J���L�h5�Q(FF��!at�EG|��z
�+d9�{#���tX�$���MwNd_Cd#�OD�nh;���v|�����'���x,T��&�"b!#*��	�u������5��['�?D���F��$��?G	�D*�iH�HZ=��V("!�����s2Cg�\�Y�:s����W+�Y0FfFz���V�e��8���B��k�$X�*�����?c8�0���r7��ZX��V��8���lvw��a�H��,���������#���,J���9,(�.��3�)Pt���������/�Ti�ya����h~�{����]����� ���(�]�������v%2r�B\�p_��AJD#��e�f�*'�ID���1S������n��n�]��%�� ��W|x�AHY�cA�J�)CE����1Z�1��#y������)6�8V;�u��d!�8c9��p!s�����#���w�*}�����G���������j	Hv������-�J��4{����2�b�r~�Q�:�Iy$
�U+�G��K�la����h����O�5|����u_8F8�c���WU�?�j=��hM'xd1�I�6���w�a��_|�Eq���^�pOSJ���"���d��	
C���tb��??	[
���u��PI5G9�rh����Snw�R&����Z��Mq��L����#^m�o���H1��T�!��lH�,�?��]��P���k+3����N���M��9����'k~���V�#�	��"����e�	���&�G�z��A�8�3�l
�z�������p's~6>��b��{�@���E����r�NH����\��J"t N����G,A$�'���{����8J8������'�lk��	����]��*�b���.bpb�b�b�b�b�b�bPb�bHb�b�b���F�`��3@�!4Oz�����	�z�)z��FP��S�J�F,���l'�v�{���~���.� �����9e�co�u)��H�������0��F��S�8�!�{�<7�}+�������@�a���\3���jK�Y�����7��U}�'mSo����n@R�e<6q���zf��P�!2C3o��$�u����Nj�u^�#�/��X�8�p��w����*��U���6�$���/��O�^��kS*����
��7|��q���-A�0�{�.X�l�G���Dz~u�����p�>t�
�$�ig�ig�ig�ig�ig@ig�ig�ip|�B���,N�byx�x-dA�h�����6HX���,I@�X@������a���uX�xX�/�K�``D��r����������~BB�������(C�'[~n��]�K7���W]��]*����r�������bK.o!�U>t��-�rjq4����*�ti�^�<�?��H�K�� %L��T"$�(��,JY����u_�p��}����d��r�#�F+Rx�����R��;�Rb:eI��!�jM]�)����fx|�Q�?��G�xY(:��)'�j���r�6�
�=�EL�W�y� 
�d
 e
`e�� 
`��e ���E���_����q������d=�[��}a�@E!(6�u��Jrv��W��X�e�J�:)��
9��y�3C ����1"T0���sE��cu=UV��;Y�����XM��O�a�o���/l����\����dD�+|T�c�f�>��Bs��(�b�
�o���Q�;%���S:��.���
�����	��\uUE9J����B��*���i��8[pSw��HF6-���~&q�R�#�Oz4�6���w�,g�T���+��q�V����Q
F��j<�{u
7�UGy��N)/i���2v���~9��d����9�V�1�c���k���(d��d��)]������P�h�A�u&�����,�D����py4�&g6ec�3�ots�o2�U)j�k���{K�"�~W�`�R%P�$�	����&Y��|�;9�G���&���������/�G"��N���D)�YW$l4�\�*�~�4$V?��}aJ�����+�\L<)���p�����e[NvF��P	������hD�l\//��!���{�h��B���������O,C)���E��1�c�+b�D����Ov(�q��j�����M	G�>�X�!�g�Li���~��8���M���m�9������2f��N����Y�Nk�
�����h�3#�Z���1bJfT�2����E,*W�Q	����$&1Q�����Q��<��
����q�`k�c�9���C/���*�o#��t�5���i/w]D�'A�I8}�>	�O2�'A�Y���E�P\������[��[�D1�o�D4"|�Q������&.��Q��Rh ���rd������'��%V����4��L?��i�?��V�0�������|�:�S���#<"j
NyN[:1G
����|1�!G�0�#��%<	�P�#��$<��p�#�Hxd&<�O�p�|gg:@Xs;�t��i����-~�{�J���id�l: �t~~�����~`����*W�s�\�
�mA5�?����d$"�
�L�xMao$&��j�2A������q�h�������*�{�-�����p:X���#��KF"������Eq��L�l�!(����3��t{��Y(J������*�U��R�\�������sxXy^J
N�_+��*
�Uu.Wz�.�'��uY��nJ�-���,�k�����V���~U9�t�t8������Q�e�S��+;��M�B��=����R��G_�����e������C���"����hVkuZ������V����G"Nz
j`��X������MOl��I������\��ddT�}Q�V�������������p���������d���gx�`x�nMB}���=nM`x��Y<��5��[SWz��7&�Hx�0����a&�p.��(����r#��8�����b���P�c;� u_�-
Z��XX�q���PEk'T%�zW>����C��
�E���Sv�����g�B��z�v�[.�rv:�uv�i���U;����`�.��y
u	�+Y�HD*�x��Q��x�Gx���M#�'�q��jn�TP��V�1[��r�c��B���6F��1	j�H���5p���0"�����:��]?��������l�j����;��3+v(��}��8>��
�����pf���a�D0k[�"�z�<�y�C�Q�7�)�VJEQ�L��v��w?d )����j;��Een��u[��F�C�^����t��B�IB,N$����]�n"�g��-���T��i%����,��/���grO�I����K�|��
��\�	�fo/�����z�x�C�fc1�:��@3��`�f�!��qLac���Yj$"���6*/�
x�=C�U�=�����$~���<OJ�5�����T�����������B��������S�|dL9a�t?UA���
!+<������4����1�����H<
�����sRc�Y���NA�S�.�r�z
����5Gx�+���U�,��v}�����	��xN�t�U&mZ-d�����Y���J[�N�vE��`V����FD!^R�{�g����8����l�p��EG#�M���������:�����l����X�.�������M���������!ND��X`��5~E��V���I�^��)�)���������9���b]~���.?�����G5�4��=
����KGo,*1��jK���e���ad�0�>l0�AN�!��A ��K� �f��M���z}���b]�����*�$�������-[���.8J�m��cX��W�gQw����������y"F���|yyvl������MI��t���	E�X)�g���k&������l6K�f�������)+��f���cu������FdF��+�B�����,/�<�.�q15��t��7�Q(]����:_����Oz�i���S��	���XpJ<a��W���A�nd���|�����'"�����m^���l7`�G��#p���{������|�}zP{�I���
E��S%��:�mTjn���%Y����5�-R��A���o��O������H$�1�J��:�'��b�-�����bYh[��]LW0d>i��}���=��>�WL�eCEs�+��;$���3d�f��c���qMj#5�������T�
���'L�&�	�d�����\|����ibH��1@n�6x��M	?������R���W�����CI�o�3��{��<g�zU��� ^e��\o��L	���Zx��k/�&��~>�a�z��y�j��I�Ox��{���J������k��i�A��D�(��-�_=>ys��d���2�e�vIm`|�1����>7,�GIwICz\�j������@S����X�!:�f"���;�~�Ro���y[�"�KEW|��a[��������-{wI �j*�wNw�~��x��#2
y|y�1��=p|��[��n�n$\���Q�nMF!�������A^�b�ya����!Cr�B:`L��rS���r����V�h1�wj{�x�CR�U�=+�f��%F1G��#��#��#����f���:���,��o�.�����y���
�]Wz�zN��X�;��Y!�	���-<=���*O���j6W�'������y.��z�����w2�5����R�f���O�Zg|���L�E����XT�����C�R�]�m:���B����I�}a�F�������mQ����N+>����u����a��T~�%����^�(mV:3V�U���g0D�6�����)6�U6��6�����������.��
`H^gDJ^�:'x�D���*�����E(�>��LAm/�m���$����R�����
��&���J|m;��L���C��y3�=���\�S��E��� �-��0[����j�H���{=���B%�,`��V����{Ed*]�Z`�L��U���F�O>�L>(Qp�g�F�:x���-���-w0����+�����l��������v���j�������m��wEy���9^>�E�$���H��%�]��J}�O��'��M��&�w��B��_��UI���6�F��h#��;���6~����*g��.��k�fF8�;��+�W��>8���l�u��`����������!bd�A�"�j3z�2f0���58��y���=���v�?g��m�T�]���+3�s��Qrj<]�*H@+b�����"_�U��JS�s���a��$��$A�� ����"\A��Ip'���ZU_�-�.\���mm���:I��;c�=5xk����v����{u�����R��\c(���{j����p��N�e�����ts
p�y�M��D�eMd����E���C���a���y������Nf�t����J�E�����7��5&pw��M�-���^\}66���r[���S3U����Y@-����'����I~.�t�sZ:���O�D�$���(���I{�U�
��b������F�T]w�&��a�@���/&���xA�@�!�<2��
w�Z$?�2'_���?��v�}L�\f�����x�X�2;��Ce����p��������KAm|l���������}$��S?��D�be�<%�a�"��O��w��oWE��/��_�3�8���:{\51��A6�5�����/{����
�^
��n�d�[�����m�Eo#Q��zU����_E�z�3�7��%ADM(rf8��^<{�������4�e���NB���Lp�,vly��6��J������@�[�5-L{6_�53b����T&PQ:�AS�����.MQi�A��"O�������
&L����;�1m[
�{��(��s
Q-8�g1��#�Rd<��t�Hx"��L��
�iN�Z�8:����Y�{���bwt1f=���>�,�����������0��\���l�"�
(Y��:��!��{xxc���)�A!��:�1<�}PA��Y�W�9�U����������RV�SV��#ml�9�.�v�{�\�����F2��u������d�F/���nm8��I�����f��Q��H_#�n������J���(e� `��powQY�Y��A�>���S��X�����%�w�����b����'�^WKjO-rV!��e����$�E�����o9�0�S
���W���%6t������u��p�p�`��&,[EE��( �n���y.�t#a��>��t���Fe8�P^M�W��O�X���B�C!�����eYjJ�2)�r��x_
-EQ����,��B���~������������R���Z�Y�!�!�!v�������[�.�<�: �Y������{������h�q���`���V
�EA(�����;��8|R�&)���up��Td�asA�*�*�Vvu�D�L����e��L�JL���s�N�kXG�E�v;�^��0�aO���:�:O���&'ot�FY�����������g��g/�(���rs�
����4��������%��Z�:#����keJm<"Bw�{X�,J�I�M�PO?N�Q��\�9o��F���M�x�/:�Z���Xw�"��c���f���	�cl�DZx�A����s��"XiX�pk��k��k�Pkf��H�����������5�7��22���Q}�L=!Ta�������]$����m�q��:�w�S�3���aB�Bd�%��W&D�`�	��qY��x'�����,��Y2��
�iL�*(B?�������e�8�r-G��Z|��7W.�����K|��Z���N�%������Oi�~T�G�R�s�\��h��I2xz`�Z����j6�1�R�������U[��vN��@Wa���`Z#c�EXq+�/���l���Us5��I���$���;��_����E�s�^��d�LM,g
��PD�(����9���4n4��4T����=��D.<������xZC����9�beZ��&i	�e�O_r�p�`�2���:��.�'fj[^����
����t���;<�V��['�
v����`*���#��G^G�FS ��u?M�v�m<��:��&!��	�BA#��I������EW����N��(s��.v���j �V���ds��aO��������p���_#���y�\�0IS�NWD���f���g��_U�6H���eY�V��xR�Ni�}=5�
��{�-L4 $H�xk�P���uQ8.�I�����:���	s����_m��D���e��[��ms.������0�P�7�l��@6S�<�l*�Mo�yYu)�~�\DS*}�!V�)z5K�#�����3�Z�NO��W�X�������o,Hs��|�p�2�������8��p1B���2����d�� ����L/8{���4�~��@��>������}����h�����=��%�4(a/�g�n��zB}d��� ������7ue�9�Y��KB�b�P���(�7�5i11W�A/��%���m��r�c��f2P[GLV�&�e4���}Z���h��>2���5\��2��lt�nC�������Y����B�8��n�(J�D
���%By�������Hg�>���yY�v�Y����)����'�&��^�@MM,L�����P��p"��sO�:��8����������>�1�<P, ��Rf�Zh��8���!��L?�^�F��z���R����3���B��&��D�c����I��S��xgHggtg4��vGF����|[�	�q��S'�Zx\�(��J�`;�4���d� ��xa�1�X���w�8�HnD�VQ��U�K���@�� ����3��J����h�a���dn�@��O
�De4]k�:�+j�W��Gc
�_�����C��h���9zx������9ai�Y����J���i-!^�#�L+��"�V2;!C�\7�@�\�B�+	]`]��D�<���Q*5���p�z������N� Oc���������DYb*[�����X0��Ce�) ��)�X�d�3����
(B���vI��M��1G"�����^�}�%t�s��G�p���gZ�VN�qK~��O'�@�\�9����5�<Lfq�����eb�qSR��/{�1����J.5\�R78�������},��H�#��P���X��Ks�g���V�{�K�Z�4����xH���v2'�=�y6���N�+J��"�^D.�W�9���Y������hp]�k���.��x������l#Bf�]<��K���%�
�U����~�g�"�"�pp��*AO����I���,,�WZ;�k9F��<����r)���#�y��G��*������V\A�bR�����E����
��	�H�ODV<�.b�K�e��E<�2�(���������?���;7�{1�r{}[<�	�����ry����W��qwv�gv���v�uc��5	����h��`��0�/�2��3��_Z��q�f�C��}'Wh}y�R��1���<{X�6<}C�i�����V��]s���uw��I!��^AY�6�	�*��EL�8�/+���V���4��-Z�6����2���y��� ]�<�j�-"K�
W�W�e�9C���w
4��P"���c5_�0;X����0���O)b"o\�R���x���r����5E��e���a�D��u���q9*n�f�[-�{��7T�6����?�?}yy���E��.y �8���wZ���q!tP�E���W	9����t���>}�G+C�x(�?mF_��Y�<e>E�M5g�d��6N���S8Y����uS��	�c��'3��/�E���R�@��rl@>7d��������?j��;���J2��RE�O%�a2�p������1>���]��g������
����`�:��a���w@���3�����Ae�{Q����/�a���,���VV���)��R���3�����9�L��UK��������o����������R���L���b���|#��)�w�A��w`hg�Y��9T"
�o�O!����U�B[���]�]R:%B��|��mW<��dM��K�n�I��yu��V&[�}zx���8��0�4��Yza���	���2��h`�^�����Aj)so��A���<�#�q�B�5�i��j�;��#����_s������X��:Q�8�*s~����CHb��r��hmbCO������	�
]<��J,���Kd<�����������0�"a��q����I�`yZ=(2H�<r�d�e4��&`-��������R�����0�v,'W�A%j��d+������'�QY0+X���q � |PkI�V������]<�E�c���+�E7c��%W���e��coc���������������)���_����� 4���4i���K0��u�pS��s���
����!���
�xz�n�����
�)>
PR#��R�x��3�Z��|6H� ��t���MDhG��p�%{Ls�S��M)���������2�Y���DLS����e�W)3E����Q�����������y+u$7z�u���\��(]��V���H��,������AD��gT�������]�QAo�����1K�w�����d3�[McYz����>���O�m�S0g}u�s�3<�(���O���Ep~y
d>:���9D��Pg�ix�8�}H�0������05�����V?�<v���5cW��a3��-#��J8�P�`5Pw�tW����$b[���1$U�O��������J��T 
�<�#�{H�?��g��'�R<(���j��Up��3��j22*<��EyrmoR��v#�������Sa�a��:����������
1v����yF_��<3^7���5aV��({,C�h��,,�,
�i��u_(e.J9�^���#���7t�P�\�Jj�.
	�� ��q��D$��9���I��87}v�BZLy�zxN�m/�A�:h`14�-�M�*D�g2�C�lw������W�M�R�����Z��=�u������YY*]4��0��M��z��7��0x@�G�����GA���M��&����n���<�ZGh�� ���@��h-3}y�i3��)����L���i�)�k�q#��]��T��������s����bZD����Z��=����s�5-�.��1���
�9o���^�p"�������{�N��5=�O_�m�~`a��DpZX�.��8��'f+e�@�g�g�88�y�x�+���������c���G*��k��[�V�"���p-��*�uI�������}M��q"���s�����k:���_���\f9 +�,�>�$�3�����C8j�M�WtZ������
PK��6G*jH��_perf_reports/12GB_preload/ps_2_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt�]���D��S�� ���$$`Y���\B�r������������T��������G��LwW}UWWU�_e�6��WYd�.�K�w��gW<c���S�v����I�2	�����L|1���`9���?���Y�#�Yh����������6��x�fw</�49�W��jw
%�<���{
If�`n�DTg����]o�����!|�4)���o�d?�4��-{�<�]Gg:{`x����V�#(�1Ov@�*���������2-�
��m����m�������6j0��+��Wq�v����i�G��Q�}a�����'%d�z�]��^����L|��ba���k�����1�j�_��p�C����v����U�U����W���X���qV?�#��;3�O�����?|��]�?���/�~���,N�i���j��b;�G�B��(.�Mp���������&%|�
�t�;��?c�������z2<�~��}#��<}��	C��m�0<��+:_����A�{�n�A���M ���_xX2l�����
T�5�
?���a�vT���y����:�E��t�I�����#�N��_��G������?�=z4��J����*�����.���k\���w���	:M��p�q���#S3��A����N�6��v�.=ma��?��������&^�����t/ff�l>������K]sM��������/i�|3�\�ZL�����*�"{>�����N��,��+n����'���/y���=��e���aSqQ�������Z:��d>�w��'�)H�t,��x���_��<g>��Og��!����3����2.�1/>�����9�4�=0�����5/����S�����Xd8�o6ix�����s
��[h��cq�<�q���A�R�h��fS�o���������N�F�	���A���n�P�����1o�^;K$T�>Nd�qu�6_L�2��t���cX�.������������?>��[6�NHCg�s�@��Ov[�����n�NrH�0b�,5��GB������Z�;��QLJ��_M]�����d^B�H�@g�%s�� /�����9JG�����"�y�tF��R��8
����@�,>Z��z�
�4p�]��C(f�P<�Y"�����h�wX��?DkP�����@bk�3��C �j�`PG���p�y��0����|_��S[����v�������1�#Z��s1�}���#Z��o4�d=�H?5�|�ux��!ZkL�����}�=���7����h&s�0�!Y��� q���/2�^�1���s�a��68��ih�h�<�aFM�g��tmi�z��3c��x��@����uP��������:��.����>�]c���Y���,�
��j=���y|���c�������2NkY`������q]�T�0����5�1:#��:����:�2�G���U� N��������O���w�b�Aq�A����?��������/l�U�}g�d%�6�m/v5�K�N�F�rDwAB'_C��`0)h�t:Pd��0,}��h��%8�|S����
��P'�t��%^��2�"��uS��;A�E'ai�r%�������k��]Q����1�������s��NeA�u~��	��1��C��/���_���?�#Q��~�������Fq�,[�EC�44�r_'���&L��Q���0��^(ID*B.��J|�L�+u���L��m��,�E�^����o`���6WR���h������2&2�
y��|������X�U9v�
!C^�
u�&����Z�W���]�7%qE�jb����]V����2��J����
�G��s|�w�8��!�.�X���k���sE���G���cD�� f��y�����J�����0�32��c�~�����r��'���`��Y�z��9���^���������w��}��k�7'������_������j����DE�AL�����8�����cu��S��bSx�V�l��{�i����x�]���P5t:(�bf��S��@�����f�	1X@�CB������~�w���NA2A�E%�y����%���SP��tp,���R/����5Rp�th���p�!��'%\.�xA�)��Bz �dA�0���Cp�c�c�c�c�c�|�2T��������gW`����I�
6S{f�M�iqI8�r�W�(>�^)��t���k{�����:���OG�
�w�y8�Z
�H��=����0��Y}<{�bS�o��"B]3Y��m
���7:H?����J@h������A��y���&�����<O�z��hu��6*k9F�lfz��S8s�>���xe�{m@��#a����'�>y���s
�1�`�pK�lo�z��-\����(�7�j���S�<�t:,d�Z	�Yp���w�O�N6�`����"�n���6S�)�Z:�#�U�^K�@tzP�'�Zm�8������9���y�|�w��hqy�n�>cuHZq��{J���������ipd[^�i�'��f�0����1��i�j���4�a+l$Q�4�5e�=�{t��I]����E�\������f�6<�r�O�l;�M�8�c*k����I(SY�(��&aMe�q0JBd�������m�p�zX���lEe��a�)'���~��Rx{<o;>-m���<����b��d_�V�0��q:8�V�H�W�)�f�������'O
��N��J��M��Y
�iE��g�������=]�/y���������r�T=�������z�W����"�����w	��+�GEB��b`o ����p����jbRf�����O��/�=���
Y��T[3�s��1X����1L�qz������^8��mU��Qj��n��h��������J�!wqO�-5��#Ci��X,^�����
�������X[�8����L�}�������15�Of����
��!h�j5���jG��+��O���}_L�t]�$r"� ����O}�"����M]4��;/I�<���x�G����J�2'M�$K�d�����U-��`{��W����Mv�jU��G����X�x�Qw-����!+�#�d��r�y��\���Fs��Rt��QJ,�O������@G�C�s
�#t:d���J�����t�~����M�!��Y�M<q��_U��j�n)��Q!�*#���X'��Zv���>uuy�k�7�������XFK��r�W�K0e&�j2�FO����)P+��S������8�:��%�M4H>
�G��y
��X/���|��2�
���^j�F��(s�s0#5�$��T��
����f��n��
���R0K���������`�Z�?��t���Z�f�pb	�����3A`=���*�1��Nxn%F���v	�K��
��uPXb�������b��K����]�T�����s
tKm�M��q��0p��C�@bt5��t�
��vw/2�E�(�R�"����*�2����e4�1R�a�beE��(�A�]{	�+-�]�RriE�����|�
�q�T6U�v'7�&�q[�������=@��Kq{	O. ���9����j�R!s���e�sdu���
����K\ oaTYi��S�&������"X��R40�v������I+��Q��}_���U&r�j2���oxx�Q�_�Z��wc���7qX~�{Yme5��87H�>t�Du�$�g� Q����@1H2�wSW�<m)��]�)�%����Vq+GI�a��p]�a�p]�aT�I	8LV�at��FpU��Y"����2��3`�h��F{&��d�jJ�7 V��.��J�R���/��AWu�8�l[	Cb�82�����;#�wF_�Lb�3�zg����wF_������zg�����;#�w�l�N;KEJ|�"��� �.�?

��I:.r����y�L�5��b�x/��B"��F����������5���/�H)pc�N�`����Z�J�q����q�g�NK��UbZ��P�{1���!��r��g)<�w��p|@���G��A�S�9\��`)�D:�����R�T���r�(x::=4�����*���Y
M��-_�E�2Jj0Np�]���y��nX�!�\��b
��\��X��3���;��Ds�\c��X9�;�=�9�5�?��4�9�l~�j��t���������b�o��:*�2\K^��x�
�~����l��I+a����):��{ ��qrmC_�m�j?�
;�����d"�G�
�QB��t:8��p1�{g��t�!X]N���>� Di���F
OrT+O7rO���Q��Q|)���P�TM��C�F�Ap�������>�l�oC��M�Oa�����1-k�|���bz���Q#�c�c�c�c�|�2T�B:��\��#�;���xM�������5DC�F���6�9���c��q�N�l�����W�
��V����I��~|���y5��0��q	Z�*��>��
4BKq\ ;�!���|�=�#�����#N6��7��7����7��r�PP��"���JE06)#�b�����$�J�����s�:���L!2 e4]���y)�<�B�T���
�U.��?|��e���6�z+���<��t$^�%��	@6�?e��y	@&�:)S�,��u|D��{|��W��/�� �"���8g=�i���Q�2|	����,t:Td����nq~�2����7�PP��(���+/S17��O��+��'�4����0���U���^
D�$�l�������:�5���D��`��	���L����c�R�K�R%q�Z�Y/V��������'����;���5W%b�^��ga:�=��Qn�:
mG��m�lK�H��V�J�tW�r�X�����mWW�(���e�J��6wU���Z(#��'T&b[W�n��U��<�(Sq�+s���Z.T�X+[W�uOy�s�s���)T%b8K]��
�K����(o���.�	 ���.������|��k���J���!�Y����ZH!�?S�J�C:D3��'����������RW�/Q��Q��A��<C)�%�r .����< ���*\V'-jl^�����Y��p%�6����,����`�K��9��b.����/�����r�q��,<��A(��ai��n����@k��)T����������
���\L4�2Z)�V��|��u��D�5)��cj��f@6d0�%�_������/��_F���)��W0Zj����B�2�x�OH�����I?���^B+����Fw����Qo�%�2�K
w��j�.���u�.�R�����(��gT��.|�e����y���
��p|i���Q�s	1�p�1UNf��������P��<�F<+oz�)�Ft:L����]�������8�9��W�"���� fpC: �m�n�?K�����J������Q���S�����~i��m[�5��� ��$a]�(mc{����Va�=��)����Q$(TnC���,�o`l>� �l����j�
�^�L=:�q����},�'!Zys	:Q�l�v����aG\W��������rl�
��;u�N���2�cGk*.�}��X��k*�n/��w���/������~}��T:���/Rr���8��p4[%�=iq\�.�Z������E���u@B�����l��
%���P���C��(2U���jT��2:/RcK��/���Wi ��y���ry61�A��o
B��(o������g�ov��M4'���v�����+�E���?����j���pt�]o-
��C1��{����W����r+��Y�y[�]�5��8�D3�����0s*����Y����Y���j�t���=�#�{V���Y/�t���������&.8eU��Jk�j����\���.m�9��`�c��0�$��$^��h
%�0���P���@x�c@z>���p����#!,��C��9'k��!����4��8%��QFq:�8#�����������ZW)��hUD�
���d�p� ��������S�!-<|+
�gaY@T, Z��baY�t���s6ygW��E��
�����w1g�N�8le�`�c��|a��������vu6�?0������iM�IZR��A$z���p�2X`pr`�2P��v��q&������u�%��vl��n�^�e���82b���^�����]);��6����6nW�&�e��r��P�����kC��rX������zw���YU+���z����U���QUY��K]�����}����/��7��E�Z��S1' ���x_{�+g������<������h����&��S
���w*9�4���L(&����HE�7����g��U��n$���*���_��:�����U����F��n!�P��9v�����������L�����&��`rrn7t��@�7�B�?|�i26�h����y���j�^�������4^7qP�:W�l"Bn-o8�mCBlNmYmd���hl�����������;���v�r��\��k������\4a��)h�?��a����
���db71:~�Q$c��H�iD���I�������Z�u[�O@2��+O&�5^���:��������k�}��
�kh��b�+�	�.�:�C��D�
�D��	�e$��	/�[54�i+�����z;'Q���!MT1d��#�&S���U��3��Iz�,�/yV�49
��r�S�]�D$TA�E$xA0we������o�T|,Q��9���/;R�,F��
��
{O|-;	`����&Z�L<,0���n�P�&3[Y��d�w{,�� q���!p�
xk0���*�*pJ�S�H5��G���Vo��I~�[J�����*I0|P��@���B�����l�6���\���U����b_5\j��:�
��b���~:d��|���X�[R�xPaH L��d���:�Pe�^����P��}+�}����v\D��o��A�L�����<������������<�=n6�Ei�DV�nQV���(�.}�[l�����o�;��@�
q�(\������a^�w1e��W�-�<���f�@&E����xm�j���I�#2���1�~���p�����e�N��%�B�r�����Y`�&Y~�d�i�����u5��������1�%�Y��@�����2R:m���"�#�;��8������'P	��VX���8Ig���xld����%���������M;tc��=�����y�]�C7T��9�4����g�,���^"��@?�%�������=:�x��'l:�z��"5�y��W�J�fhz�I���U��hN|����|�2&!��v/�c� <��H}!���.��=�c=��$f���?���(��Cy����N�3����DU�4�O��P[j�(���O�J�Z���15����<0*���S/��o^�q=M���>:l��(w�T#<�Lz�2��OW`��s{:
�������N�h��)\;~{8���E���O�%�<���[���I����e��Zee�2XVk���o�6Xn�
SO��a�[���^ ���W���KI�q�/T1I
�OFf�����HD� ,�p�A?(��s:��)$��P�q*��d��{�����{6��Em���\�)�",Z!a�������a.�K:������q�f���l����|��b�-��T7���~���G��s��{�7e�xm���}�2T��@����W�z���o�����
1�tN�tCC)VJ�XLke��t*Z����-t���<mwtx��W'dy2	8/U����s�fW(yN�����`���4FbVe�Fx����9bY����P��4>��4I��^�8�q�U���
h3}��W�}���VE������\��fX����45�|��E�-�N����<J�����O,B��E��E���V��+sQ(s���(��h����\Xe.*e.Zi#��J�+L�� =.�l�d�N��&�%:�j����cW ��p�+���n�wA�/�V�}R�]���Q�
D�+~W ��@T���-��*B(�Z%��"J��Jj(�[Z(FxDx�nx��d6��"�, ��l��c�_�r�>��> �hV�o��Mi��$r���&ply�[��s���������Y����<Sv�?�cX��3_mZ4+�>W��7;���g�B@8�*�������-h=K��K ��{`N\�&Gu���<��������#<%����������n��;��D���Q~����@Y���Ak�GG<6�h��,������t�<��������w�V�w�����9�t~����\V(zO7�5�h�?J2�{�o-q�!`��&+`�b��How�M�����&`�[���������=�������f�u���E���W�>M�#�'������M4i���Ec�1��������]-�B����*���a����UY��a�%��~�>,J�(�#�����d���]��,
���E�K��0?B�4h!���g�8F����`00Q4��q��
�+K(��`PQ�#OF��v�;	�NF�8X�VaGa��1�����

��T#
���jD�TE�S�(�-Q���lN����K���H�x�w�����p�t8��$Im"��!2s'����!�}�6C��}�g���3�/Wua�g�������sF�Y���������!���h''�#C�Y�n2��L��y�y����=O�:�,��K�V�����IFi��_����&��8]�m��k�p~�Kj���N\c�+�r���2�v;m���y�a���9s�������1%���QY��f��,�������z�z�$���}���a�Gym�8��
��&���W�����-�B#?��Q>��Gp�r��4����B�X�*�pF��g���s�����mY�������|kE�rB�����{�%{1v(�q2�m�4�M��M�P_��RT
�D/��!��QhN�����+�&���h:
?��o<;���*�|k���I���d�7��Y����V�9	!�����|[^��������%��/����7�H(����%�\��;�/l��a���A��#��q�Q�p�G���v���x������
40�'<��4Z3e�O#$�C�DZ�*�Rc��J[V6
cT����h\�����4XC3���3����/��D�R�u��7����i���Hu�������*[ v��u\$~<q������IH��5t�g��������!_]�r� ����s0L�����U�*0�vVh�Y�pg�z����
s��R9�����]�X���7�
|�S��5*��cb����C���
1��������5���D�#lqC��g!�����49�i����#�9��D=����81������|��z��7�qH������%!���CE�	��#x�@�<��G"�<�� ��GO��GZ�87�jO��5��08M��~�l�w9p@�n8�@�����h��@s�d$0����L���j	��i�����s����*Q�"����P�<�P�~9O������q���8���g�F���y��"W���I���cO	1��q��=�?��d��w�@��A�A�@��*A��P��_�� ���A�G#N�D���O��"Q�����b���zU���W�6/_���Z? J����/�F�`����I�O�/v9E������&{��D��<R�x�|��5R[����6h���d2����"�^p@�7���H	��b{���hkYY2m�L�K���Ly�y<��J�	W�x^uU}�K+b�0����*=/���i^����P�Rj����7;���{g"��I�?��	\g�[���o"Ooy}b���l?gO��@��)3PLQ}7���A#���6���g�x2�<Z�[��Kd���c��Q���A�
v^��W��Nw��:��
2d8���S\����q������Q
���C5��j���"/5E~j�<�y�)�PS���ug5�F\�O����>+VS�U�I'���7;�"O�q���P�\���E��Ez�D�AN���&��b� {2�K�X��?"}y�I4$y�����W����F�HN�I����V�+�&9�&k&)U&)?v�@��#���a�����p��1�	4���I�~���M.�v�B���7|�V��^���"��"9�Era�$g�$e�����-��-��-��-��-��79g�#�h������pI�f_��M�AmMix����*Z���=-�1���m����y5h��5R�M��\�����~i��d�C���a�A����c��������o#�/��l�,�|��iz��q�
-V��m>|�B�s�����Y����~e������+��[Js�Fu!�O�J@SR �����?��aL�w��b�*�lQ�\����SP�8������TP����@J��]4��B���o�l�eGp�z�xY��|��<����x��7?px/TLI���i<H�;������@��������GC�p\
���^���E����y���w�[G]t�S�@B H��@B"���y�����*�~��������xb�\-?r����o����,
���.J��!��V�A!���f���6��tO;�x����jQ�����8�Qk��*��}Q?���*�/�[��e�S,M�hM������`ZN���7����J��O����n>A�O���h�	�|D�A�oSoC)�y���UJ��^���zo]{��
�0�@^��pYyF���`MF��C��4�Y�vp�]��|�������SH�w��!�_ly}�]���[�;�&�����Q��Z��N���f�gV}��C
4�9�8z:�8z�~�q�vrY">d�i��.��O�����	���&�bz�\���DkR���yB������cA	t��8E�&��N	t����xF�LMf��'��'�<�G��K��P!>@�,�t_��E�8����^\����_��k�zY�6+e�R���(A�no��U
�����|����v�	�qPh���!�|�)�-�l���/��_�H�m�&�JYV6Rn#�I4�s1��W����_C��G�}�i�*�5���w����$�.+����@t�y�V���i4���EP=�&���|�P�1J���������n���^��P������
�pc��)}n���d;��ZJ1@ �@�:�������Cv��u8�G�K�D;���=�4��`��~�����3���ww�w�b'�;z/��S"��R"���D���D�!J��N�h
����aK�-��9?Q3�����#�f�f4��("��|4�@,uq�q�v
�A��`�D	�p���CP;�`jLL�����f5g���4+��b���]GD��n�g�u�h��um�K�U_RKz���g�G�o��8J�!�yAFPW��j����zb:v��N�dk.Iew�jwKpw�uwL{���wO�o���N)�'������2�������p�h6��y����+�x��?��[-p������o�]��W}���/��mY�q4����Y����@�'h��{�U_�\uIH'%��@���T��u^V��@Y�deP��P�)���:���yi����t���l[�J���iA��<�W����l��[Zy���'�������qa�i:���;M���y�n��g��>M��HM��R�l]�������`%�;p#9l%��d_	W�����&+uG���Z��K����4�+�_����Q�$����p�����bdeE��h �q4�0�`+�e�G����(�e��%�	�S����=%hO��9�;f�Wv�s��'t�f�q�1?���b-o�8��VN�������]3���%��[F\����`Z���8��n�=�=6,�->��?���z���� O�DW����>1���
���\���K��=�7��
���RM��e��*`��Ivr�o��FV��_�:�C1Ot������H�8�X���&,]�g8���~�ar�[UM��#��z����O�
�D,���?U��O�w��-g|��Vi#����n����1��M���{�A���O|�V�k��bZC�c�?$�Oa#IB�4�_�
���z����GR�
ol���V�f��1>�G�{2�r��9���d*;R���D�r~<iF�.4����?p�8�c�so�p@�P\��8�(�_�r��>��<��y�m�#0�k����)1�kP������ 0x��Z��(D��CN����Fh�8�\#�T#��F"W�Dr�D"�DR�DR�$��vS�����������)`����vq��<kT[�T[���&WmMrmM"mMRmMVm����.���a��ya������Bu��%�����>��O����g�q�����Lg5������`�rLV����y�5����u��HQS�����H����3��m�8�&3�`pUg\A�
�H���M�a�{��B�`��\��J.5��qS��YSok_E��p��"�a4�I�0�<�W�v�+������>Z�4�~����,���CN��y���'��nMyA_F�s�3��������txx<-]��<��Sn��%~
����8�y�<6��1B�Y��@�
<�G(���#x��D�#xdA��3'��v�u���|��N@Y�s�}�����Am���[�cbV�C�>�y<LZ���?�Y��;�Uk����7��&g�S2���.U���u����uH}>.^��<����U�UtY��G�^L�������c�m���
_l/�i������l+���e0�Xi����NN;��'Zv���R����r���D��M��_kO	��t��1�����(�N���	gv�s�,�O0�e[\��uU?�)�5�^_`�U�cE���s:V��D=@�SC;*%�-<����3����.drn6	�w��%�s>������dr���q�s���P�`���K� Z�A�+�W��T��K�xz����-~(&���9�}k�i�z�����T/���
��t��@{
x,��{gW��	�8Y�+�TU�6*�+U���At&��DP����)�e��)��~��|M�EX���
�������FG��_~L�N.QE�Q`B0g!E`RZ�E������1!�1��t�Yla�����,>$����2��������
�l^yT#i��V#A��tL[��K
�w;��{5�-X�������#�;,�\�f�! ��C������%�B��&�^�
 �H4:
�E ��4
z~�d�\�FJ��_�'�p~�x�TnM�!�����0A
��l�9��l�o��V����8\b]��bv�^f���Y��%�5��S��S.���@�
v�4�
?���y}AWG=�'��lBb���9�m���wQ	��c8��A�m��6*�d�C�g�?��N�_?��i��m�a`n|TvY���Qy���l9�Y�\������/���q����@l��F��l����W��K����5�gcqU�;�������kI�{�C�����x��*{�B�u���_f��y�/1�z��wN�9�O�"3��J	��L��8�M�`�^��y������h������0"1�x� �������$b��P�q(9�
D*��C���@�P�s(�8�
v��
P*k!�R�n=LA�T:@�3�W�<�����}��TMp�`���.�:�-�W�+�	��9�-s�B���{��O�k���Cex+�����?(L������<K"tse����v��mQ4�Q����U;����@P;{x����W���H�~����v�������Nw���1��Pd<��v���������5�8b=i��IPU+�d�F�G�A<LO��\=}gT���W�>U����O����]�5�����|�Z���<�:���E'y��l��-����@
7`��Fd���Q`�[�8��CL���&u�xzs��[i&���&�P����'�����P��s�������,�i��+�[��_51��%4��_-<=.���`0�� }��}7YC8n���E>s���*���vxQ�ck)���"�5�_d��r��-�)%,\���:W���<�����}<}]���d��b�1�e�|&���I�r����M�&H~�Tn��nR��c*_�B\�)��(c��*�L���')2�$���K�NH�$�,S�/���W���-N:�	~V�Hg��(��)3�x�&��#��T����4v�Mj���n�Y�Q��1�d�;gv}<�F_�3]�/]��Iu�0x����V3�{�i1��S���~�����L�b& 2��	H�T���� He��`�	��0"B'���Vk�������,�F��
soPS��j8�L��Z
�j�t��r���s��U����u�3��Y0},�5 ��e�\5��@ln��h�A�.9������:g�	[
cc�&���j�Q��(�x�<
D��G���
�7%�)�(��y��]j0��0�93�+�)C�B�O��p��[i55-�������lV};Bt[<�Z��j��:�����,�[��tB����J��LD�,����t�k��+3�^�����C�1,����^���W;h;V=��9(; �bpg��@h�\���
�d>�U2E����]<SZ����_����b&���x|^F�){��`�V��!��k��T"i�JQ�M�Q��J���J!s�Yc��m�v4�y�����f_�����&����H�S9�(��hl�
K0�����c:��,Z���-�8�����5��b������u;�n�R������cN����?�(���_�T~����2�!�?�Df ��6�	]�B���k|}q��d]�oT����j-9��������'��N��Ue�`�I��0�5v��+.ae,
�0�0.��������4�j���fF��w]��)YXG�TN�zU|�('�H,�bh{i���f���h�����)f���?S�8h��������r5Qn����4�����lQ����y�����a�r/��������
�%������?�pyu�����&+L��Bd�D����1��9�X��:O+��}��meN�Wt����cv
���}S���pf����.���V�mB5f�Y��.����0O��$OSvq�,& ��O���%��6R?3�S3������h�o�u[o���B�;�/�:Ic�����������3�DQ�N`�D;Q�D�
�~��G��[��C@��M<�M����>��Hu�N�����_�8E��8��2��r��<w���X�-�@����:��i�Qy����yy�f�>�d���0���I�4?FA�^�8~Q{'��E���9�R������ip�$w�w�4�j��`�Y���Mm�����L�a�~Dzc_!���������&���mD���W?���Z��au@Z]�9���L��@�]��m �^���hm��	%3��8���������=Y����\I�W.N��-<�������;��A���"�����3�]��!��HG������T	3��%!��V;�����||I��LU��C�����&�)7?��'��@�={��m�V}��BdJ�����6S�G;|��p)4��� M�`��*��%��(��.��$��(�<S*�yuQn@@H����5��r�&+��9U������*���v�J����
|6	�m�����:��w?2Y�R��4�� ?���V������/E�~����h����t������}��8���v��b�8�W8*�ZXt�R�������� �aP�fW��j������nJ{�ROby����)�t�n���S�U/���9Jcy(�y�)�9.���N��xz�5j�lT��"1e�bo��Z?feRL]����=�(WWy�2
�j�O��y*gL�1��@�H�I�b����a����?E���M���M����)MM������oF�?S
�1�*�x�O�����)�e!�K?#�>!1(_�!9E�}�qJIo��
,(����T����)�����
�2��QZ5���[<.Y�~]�����
����S3�$�W1�Rv[�Y��f'�r��CT��xv�{c�c>����FR�� �q�zu���J�v�g+Zx��S�3�}Js_�F����<�.������"�	O����B������������1��Qn������x<��-.~��zY�U�������1�o<����q��9�_x��w�Oci�L+s�RC�\�
t����et�p���5�,�G<^�l�[Yk*�V{��&��c�v���?����z��A�HE��Be���Q�7�}�������	�����l�P�0���ch5h~�ha�����x;t9���tX�?
=M���.��yg g"m<�9�~^�b��<W��iB<}��C���Bm����Q��b6�&�q����E�FT�+��eE<�M�Fk�.������dr)�<���%��o�����
::x�x�W���:�e�>��]&����pE�����2"�p�Y�$�e�$����A��t/��E����)j'�=���(Z�_o���^%��	�����t���+Cq�l���,V�8YA&���KN��PW}e{����Q�!�DS%��f�����lr��R��x:vDy�b�l���$���E��$�����G�L�Q�=����
y���RR�''y�Y(y���k
xm<}���J����X�(@�s �%
����^?��%�� ���`Z��q
�n|��]Y�D�E����^=�w|G�����������N�j�i�h�H����3Hx]=���F��Pwd�zPe�V��B�R��6��T��1�4�f�@k��6<�*���v��n�SK��+�������>!�l�L�}�$w��tD�[s���K��������^���Sp�&]<�{����5��xy�=��FP�Dm<c�=�-M����{�Y�T�y;�o`u�
����B��\'4��fR�y��a���Wk�_�D<�����l��DRE���W*{Y��F�Q���x�"�.�����i�����R		\�H���W��d��5��P�$�	�����<�����Pe�����m������?�rz*�|�uT/��(�u��t��N-����m<��$��R�(����p��u������z��o�T`7���&�D�x�$E��%�������%l�3�'���������H���GJk}>e�h�5T��-�d�����>A��L:��5������C�)�Waw�{��Tx��-���9�����������T�8�_VN��T<��te0�8���e���@N�F]L�*����l�m�8q<a���v��:Lt����s�����Z����l��KV#��������'M�����,�]�@6Q{x��U[+M�g/���;Q��c��PTU7@&�����A��R?bZi��&p����f���i���@	N3j�9�>C��w-=n�@���i/���TjU��J�z����W���,*~�!���wp�\��
�D��'���y0DPW���f��B�V�4���J�u|G
s-N|g�����F��s8_N��yi��5�u���=x��X`z����RhJ�d��/��?��N]�1$� �42�q�[J7�����-z������f3�5���K�=z���p��>��/�%���1%��T��������%�������Y�����s-r��<����-y	%�HO{���d����`�HhWZ��',��q�����WRT��I�����	�C*����	������0rtFq�=��c��t$X�����������_����aRA|K�8���xc/���(U�*v�FH-�H��Y��07m�$7m�E����.�2!�g���I����S��%d�����/
�
q��6��e��~���)NJ��k\,?�/��"�~������Vp���!���wL�|����������E�
���}<y���S�1�&��Uk�+���hf
d�w�U�J^a�!��^��<a�����3*��~�Q?�Z�0fX��DW���c�,)x0�����Q���<�w�u���5f��B]����)�L��\;�5�������vyq��}�[��
����I8�dTb� ���t�WS	��k�F����*W�{��PK��6G���U%s�_perf_reports/12GB_preload/ps_4_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]{��4��Oa	!@�J�&} !�'^���BQ6q�a�$���.���8q�i=���]������x<3~��/�k�3���<�>�������K�����{���b���iC��(�B���{����9�1JY���M�����l1�����b6���rv�^:Kj�<��[��A��;��U��n�]C�!L�8O��}��3���[7�W���<����0������~�Q(��x�-{l?�������2>����Q�C�Oy�~����;���
�fQ������^P��,�����H�=���w�4y�*�-^7�^�����=��i�-k��:��a���|�y�������>_�w���`'>������y���y{~�W�nx����j}����n��b 0H�������c+�P���?~���s����/���G����k�j��a9���'�;��x��7�����i
��/��!�^�������7��z*<.��E9����It7{�=�1c�Z�u�5p�5w}��Et8�!|���kW���W�p/c�ry����3YX��� ��G����m��X��Kx��4^~���>�n�<���t� �<�����������������#8���V��M%��M�,q�C<|N!����Y-��\����r!o��Av��u� )����u������G���i�����[ 7���u���lf���O�+?��`X�^���'9��p��jI�=�t��'!��~v�����!��� ��:�w@F$���<	<F9wTZ���q�����p\/�����yQ�
�����9�}��?�������2~p<���3�����8��J�\Y��m�B"�G���m��LL��sJgpj��|���/"��C�E$}z(4�q*z���J�5@��,�w�%�BO�&g=�����{_�7��<�2P���4�g_E�<K)������gk�J�e����������'���85MK�
�=�f��
Mm��r��;��Ir'����3�SSe�;�i0
U5NM�|E��*�@1�FZ*����-b�\z�red��&p$)�l���"�;�5��S�r�(����8q������~�W�4�,�
���orw?��BM�S�2'/g�1(�0�{7��M>q�j�SI��t��K�=,�L]N�%qjB�%U���/��e:�}��to�3k�4N�����T5�������*�Nx�V�*��R�3���8���_��_��^L�j�1����B�����y�,��i"��RW}V��t�@��gM�W��"
��BO��g�6��0������B>3��zFp4�z�&Y���j�7�P��i"�
���(�������
A���%y?�j�����y��s�<�\���w���y�]��S:o�S�4a��"��v�n��Pyi\��t���&���7�z�V�j�T���1�Va�������-���Y��I��d�-{�bZ��Zh�����uQso{�b���
�y���)Kcn�� �r����\��B�uVs���M{����K����	�����o�Jeq�^h��gs}y����6%W�����+n/�0>������m,L�YhX���7W����*�������\�-m�����3�YP��T�,6�d�y��Am�^�V����Qn�z�!�����J��i���3���$p~3�$5jr�d��\�^Zu�XdC��Pq?��K��^*�U8��s����<C���E��T�����?�MF*�Adm��l�>��G6���OB��p�qz�7��8
M*M�����=��T�4�te��Gpt2��QY�"���XM����&��k�l�8���(��%
����l�B�E����{�"�=�O:A�x�
����4YT���~p�d	� ��<v�Y%��q��lqDK0��8�f���!���L�L�����}-���/�KXz3X��;����C��(�:���gX�����3{)x%��}����S~����X]1��5
0�����Gt��VS�z���	�v�#�z
�i���$:Q�amE��o�6�K�p�'y�b�,�w��y	�?N~j�d
\l=��'j��/��9����;�C�%C���
�M����_����]Z�&)X�}%�����s����N�.I�8;�y��qcXm�M��Oun�����~�����~b�F�kX����7����=��)&���4yTD'3c����6:��^������A/���[�k
�&:��s�!v�E�W�_|�
���,p?#�NltO:�7��2i�U�6c���*v���~Yt�fB%��E������*Nq���m���(��b��{�<b�*�5�Je,&\:���#)GA�0�a49B�y����c����MeQ����Zl�~C�6��
��z�
�+c��o�6�j��O�]�7y-�W��xFA��y<��d�A�����P�Q��ALz}p/��<�r��>k�S������:�<q��-Np��O�_���5���E��+�J�����*����p�[�?O"�E��K�h��~�,��Y���Y�sr���5.��,���P���!�Ry6�1�{�v��7+H`���N��S������D[���ae������_CH ��6�v���:������8)7�q`���i�S ��%�5`[y����?�����q���9��#8
ed�qs�*����ad��p���b����R�*wX�U�N���1yg�������Pq�2e9=NC	Y���'a?]�� rlDz����2�x+�1��_�=S�g����n���� �!^��P�}�=T7�,`PHSn9f��JOG���x08������VZ
E�p��.+[�(�v�=�L%��!���rf	7�W�W���_��`�1{����F',7��E�)(���Z�J��!^�<�7�}5�o�����N���
��������2P��?A�r���)���
%������������+�<�o��o��yV�o,3w}����c�F�fEk��1���O��pn5P%����Z8��O\�'��%�[��}�J�C��(NC����!��S8���;I���!��!N�Ox�������)v���F'%/�[0�o�^�zB��.�?eK7
��=���g��?i�1��R������,O�S:U�8��Tv�6�����k=_�>��e81���\��
=�T����$�
q4�z��CH�,<����B�Gc��L{��8����H�u�{�����H�>(��C
)g���(� |�E��_
�yn3�x��Et�
���&��y�]���v=/?*!��C�k1)!{���c�m5���P3���\	��'���%tD.�)���p��5�,�T��:�*Doq�V!�8p=>���'	�Jl�s��~��Km��K��e�H�4K*��@.!�p+�Fz#i������<,".��d������y7�u:�mrc��'~[��X����-�����q���8�1�w��H~��D���eL�U������h>��;T��4���EA�#������_8}@���_���\���_��Bx��c��F��\���CSd�$�E�RSX��/d&�+8�>��6���A�d��ma�������	24��I~4�x1�k�����:�+��RdAF
y2���?��)���)r��J��6x��y�����C�$Y��"J��6P��([+�M�z�����M����d�S�X��57��m���,���j�s_����^���J$d�	�c�m�PV�Q}�����1=e�i�5LG�az�����X��:�3[�5g�R���n�!�#V�Ld�g�@&���E�s���QPg]S��H$�M����4��Ac:�����Ql���@�a	D��_���*�([�z��PP���=I��H3��P=��^��%KY�h%�����k� �E�X%��lc���j\���|;4g`D"Y"�_}�C�S�����.�i�����':C
�dDCze�����1�6]:�=m��t�pt��2��:�80PH��4����z�@��'���*2v6Rn��!�Pt`W���uG�5��\�L��@�V���"��7�0�{8����I1�c8dUn_���/B~�\��B�	��:r~	1p�"J�0�?cqs��:S����0�GN��q�~�!���'-�ol/�����p+`���O<�:�;��Q��/W$�
i�����8��
iUC�10�������-�$���]��uxPc��m��j��F����cj���&���'D���	�>���u��bb���V]���[�a�A��1[y�����/���v����V�.���j-*��� \���
�
n�}����d1�)m��
�K�y����7�bdG�.y�:��R�k�h��	��!�E��A���w;�,�!��,��������Q��b�){��?��� 'n���"e/u�8�^&�Z���Y�/ys�������v���h��u�8B�3�n���*?4�)
qhcQ�@�����9������\u{��c��s
���pR�K)z��4��s�CUaP+}���$��8������daB���)
E�=?�(A������n��G�m��8}3(?N�"�����;�n~�g������ p�,S���#��w�C��yz|r�Tn_(� ���Y���2%>@z|;�<d����-��p#�!�Au=�i����%I���5l:���\���5�\������k���/)��l���44��P�lS��0�(�^�����E�%����IJ+�k�l�z�'�������Z���(98Yc�Q����%����v�<�l�(3�(�i�<AfN���9����)�:�A���z�
��8J�������(�3C
����t��f��w�q4,YG�7r�c�:�_�kb_
����m	N�,�r�FgB�u����mS1a��)�%��u�R%��\�k
)Zy��ej�<cU>��f�:���]��#�'��g�Mn
������Le���N���M����l����&��.�%yl�Bu�����)��S�w�,4GOY>�S��D1N�2�y�����%g�E�K���HN��'�'����>4�a��\G8Jg����^��ax'��
p(1J��}^�����N�rO����Gj�H���P�'h2���E�pI���
�-!�_������-�
LB-N�n�$��;�e���k�j����-�"8fEu 5P����	j7Q�Fv���d|�Cj�W'X� ����<[d�,�XL����s�L>uw�I�t$l������.���'�F}��:q�?~�3��nr������)����q�_�
z���#�����Ej����yP�/���VB2^G�"X�,'wl�PT�h�"�2.���&�L�F�.NC�V��m�9C��8�IH���	y���"�|�S+gXz(_h�&�>
8kTc
#�\��c��Bv��PJ��^(J��a<����nf�Z$����`a�V�u+��B�L��|OX7���,�����������@g�D'Y�p�������3:y%%~E�l7��%M�w�,I5f|&H��������!N[[x�����YcBr��.���]����a��m��s)����[�4'T�u^w2�<�n����U+������]����B]G�����>�Df���)f^��-�S�9����*�F��I��0k�2��HI�������Y�����+<XJKG*w�8�`����;������� �2q�*��K��<!$2����<r��X�
�������|����8
v�hH-vx[8��Z���C��@7+/�$OD���R��D�A�)��*����BJG�w8x��������4���J�J�R2�����������u�����_l�n��S�8����aq1|N!��i��SgQ��D�%����I���L<c8D����E	=t������T��pNO����]�DM��[JU��%�+��<����kq���e��������?_�	�:�� o--v�O�P�c�P���j��Dy`��y�����7�F$z!���X�T	�$�p3w�.��PJ��BG� �:	�h�����8$�.-'��Eg��B��UOBp1��h�W�^�P���X^�`��������B�N]�"!�r��4D�@�����qg��0OL 2x�9O���Qh�B�<.5x���_D�M�k�y���AP�����J��?8�h=����9�nYX���DS��_G��(����Z��%��=c�F?�H'��21E9=[����Sr�OJgN�l>�G��.@6���p]�i�����L��l����l�q������aCdP��V��V��V��V�G�p�jU;��LeQh�PO��/�V(\G�Q�pY�*qYp�qYHjrY�
sS�
(B�i�\��t��Z�.���O8t�
%\U
~Eh[����
QA;���Hi))	(��&�I%��#�<��T�������m4���H����1�U;��fA@��Ps}�C� �����c�2�bQ;�GnJ����H.5������ �~w�i��1
f�:q~��,�b!lq���Qm��������N
�w?E@]W<�<�@K��������7���&i��du��v��d2�d����wL��n`�W����TlnH�>F1~���x��K����"X��q�F;L����i��!�	x{���h����=j"T�+��m����C�u!���%:�������X!���b'x���Ir<@��1�-������Dwn����;W[��!���������N�Q����%���3j�8��z�U:&~'+r���q	���X��t��CU���E������	�g� ���!�����;���:�����g3J�_AU�%%�UL���q�����91��F�n�+�0'�����CJ�r}��z����m\
��nV��7�ndF�jp���������b���kM�!�{:��nZ�8*<\;����F"�O0��9S��CYI8T=+��aQ��lea����x1��6C�e!�����X��Id
����+��d���F��?�^G�r�di���.����R��S4/��$mL�>UCwc(��C	!�+�V������_9� ��}�D����2�E�����QYz�r\�
��RF���5�yL�-�P}�0���g"�=l?H�J���0S�#��8l�o�v��C3����x6��KUW��*my�9��2L���#�>��hr�{_	b����Y���pe�v-�W���M;�G1T���d�71'��yR���	��G��N�"��Y���=����Z�1����Z�L89��prP2�� e����	'�_��w�s�m��j}��l�����.64p�R��iG]3`��U����r������h����T��68�k�L#K@R��{F���YD�E4�{�[�"�����O >c�~����i����M�����r����y��+�^E,66P���x�x���L�q�|��o'bt��5���3��B�>�����7�+_%1G��0�a]��7|����f�x����5������(kn5!::����/�Z��G^5���i��Q�VI�������8��������?�
	�����s�e���P�0���eMg'q���|�p�b��w���ld`�~�C�Y�����#q�g����(��a!�a�
���:�	'+q�[��(�u�j�9k���q!�[AO�}X;;�
�[^>��8���]0Y���di8>�Q�Y��cH2qq����I,y���(��z��B��n�e
�4�z���%�[7���_�E�+
Q8�����3���.��q{�����Yqd<�����.*g��9}�n?�� {��0?�bA����y��[��cP}9�7W<�������I�R�NR��qz$Qu�=�G�X�>(
����=��z����@��U8��:�I�!��?�e�_��*��N���o�1/����]7.�y�p��D����lx����Z��V�6$O�0� ��WN����C�������$'����_��qR�����~%�����g��)�p�"�;�3��YH�K�l�y.Z�`��j�CK?zXn��N8�h?��=O'���M$��^*�"�=h\|���d�D��9��n[�JS�>n�����Q
�I�����>!���"N[���_���=�����iB�sP�5���6�]U����?F���ox�1W��|[�9�en������x��0�S�������w�+~�C~�d+L��m�|���N�|���X�Cjd/�t�GN��������p&r��k[(g�pn��kj�F����
��g���o��3YU�F�2%��������6���h�p��68�A�����y��O���T���O<�h�g��N�ex�U�G�I��(y���j����/��d!�!x��2\C��w�F��!-0�0
p�)�����o�9��v�\�+<,�T�p|��j%������:���5<���4�x����kZ�5�^�,
�A{��/�h*����=X�y&G�eJ�6�e
�Yq������v�� ���D���'qW�9�C3�gT�k
�?k�^��Tby��Q�l�oD� ���B��j)��������9-�@8t�U��4W��q"�pt���/�]C��J
;����Y���\|��g�(�-�����Z_4��'�B3�����]�6�H���v%�����gM���c"3	,�Q� ���2����	�G���{tu4!�{������t ��Z
�����������1M�_h���#�4�qZ��(
��J�$���8��=���eN�r����L3����(,���8�j[<N���q���`�*����u�.��]0M��(���j�?#O���<�� =d���(K~\��lo�
�z��~��k�����R���%��311�a-3�������0�����e\@J���z,�����7b���3oD+���p�E���&��Y�]Ww1|�7��H�Sp|�:���l[<<&�A��p���U�6��e���u�r�o�n����~sq�H5)�M��D$,EQ�[��D��p<��P
�5(��|�IyN�H��Xy����S����g�u�F����^��?^�������I�j?�$�����'�Y=����a�}���n����,<��I.�����V�q*�}�&���[fA-�����s��C�P���%SU4:��%���s �?m��������4�����$���=��(��C�t�l4D{)���8����H�� PK ��8gUj�3(L����$o��)�Vq)��q1�8>�"����Z���;p<�^�@�t5Z�S��h��
*��$�dF�0�}��b��O:Gk1��}@���On������;�>.����Y��"O�����d���A�)?H���j:N�\�����Tw�/����3U����&!�Ja���i�`2��m���H��/\d���9��x���l#�a�\�{�\���zX��4/�������*�+�PK��6GB�A��PH^_perf_reports/12GB_preload/ps_4_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt�]�o�4����$�����BB��O1@ ��,q��6	I:6�;�$m�&K���}����m�������;;/��uz�e���e"�$~��	�)�6X����[��i�0�M�������b��I��?�?���p�g�4l��
�����7w1g*�5��,���7����5����s(�Y��rds��:k������f���<������������Hb������g�|f�	� ��O~�
E!�X�;�_�|c��;{c�O���7��${���]�r�����������]��y�o���D�d!\\�E
��t]0��q����|��`#�7�xH�W�
�x]����o��W{�U�k���o�"�����:�z��w�
H��(^�V�������_��'�����x�e��_}���_}�#D�:�:����7`��)v�Q�n������}�.�`#�������-o�k~����W��z3|X�$;�����Yro��C�2-ga���������Cx?�n�?���_*�WW���j�|�^%����~j'�����(+t���
@��y����Sw����MRx�X'��+v�������/�|���h�yi��������x�����7^����A����C���n>�]\X�r�
4���E=5�����v�4�2�O82?za6;��Mt\��ey��L���S/�tx
^������E�[����Dq.6�]s��Zsu���$���q�o6��6����|�3`��F�^���G���=��l��X����[�Ol��vY�C���"�~�o��K{2�9����>��������BxU���*A���
�9O�vO��,������f��Q8�/�=��?K!�^�o����O=����@����>������i��>��| >>����f#�sa���(�6I�Z�:�a�����E�w^�kIb������������uJ����l0T�Ee�T@Kp�1�
�(������wga��0�N�.C���D�����H
�rQ�M�+6/�1p��mmE����t������[,K��G��Z�[������d�)�����_wbG�JIZ�Ipql�g�������i���mD�Q���KCu\X&��sO�R��]R�J�e���
�.�n��|u��Bd�.-�zo�z~�3;���������:eT�E�X�%�-����\�^�2|�����t��NxW��G����E!E#Y���X�K�H�� �7��������Q�	o���A�8��*�6��w��z�����}5�����������v�,Ptx �������&�����5��l��k���d]ux�����[hV�d.Qr�pRh�m)��\)q�PA^T4>�����@��W������<��g)�dr����ZD�|�V-�RQ3%�SP>�����v*(*Og��K����:�1E���y[��9���#�p���?v��t�?��T{�p������;���vcMnc�_WFctj^,c���R[�QT��'w���^:��Z��(
d���]�9|��X����������K���u$�/����9��n:�fl������rL����4�X�9�����8������S6q6`L<���=�i�>h�5}aw���;?����]��7tfl���zA��~��.Sb�M�����&���Y��Y���>���Y2E�/��t��(,���/�~��;J���0��0b�,�I�(Q��
Fb��Eo *����H��Y�V�uq��i��������Dki����"�����>:��+�sC\WlK���\���N�"F�:�"Y
$������2�{m�pD�]���E������f�?���k��t�e[������m��4��9��:L�Y�%).x��b�=�����+���R��/�*������k�jV�Y_{W��Zd��������B��t� ���4���53��@���2���o�(�P�4@q��,��7[�Z��C�A�`[���J��������"eI�Z�Nl�mN��1��U�R��Q��R��(�d�F�����;6|��O
�)������
_����EF��I�}"�+����r�H������#�-��S���n(��{���\�7mh� z�����;��A��h�X�K��V:�����>�}BE�&�9�����G��F�Q}�Q~����G������l>�n�/�>��[��nS|���lh���tW�C�w���e���Z'���vq�q���hQ'��"��@u����m�.����������Z��}_5��W��5�R���[r�(��K�(�Ip�+�����s�DiQ�t�Z�)�� �����t������#�,e4�y�U%]��-_�`d1���SSL�	�FSP������R���_	�����y}�W��@s��]c5��z�}����'^�q��5�Un�h�;
�Oc��,���$��L�@��w`��tDtD�ETEg4���~X��re�Am6TwDF�U�2="4"�\����7o�x@��-	��HP���]G�o5k�����IT�5��VPp��-	��k�m}P���
�I=��!� y��FP�qT�&(�F����+s�������{�"d���������_�����N��S0�x6�P�������<��{�����|����q0��+������?�i8O�~K�4f2N��^!]����R�A�r�._9�o���j=��20�8M9����W�	�<��H������|~��y1W{�����:��;�~������]��G�-��Kt�~��M�sNg���YQ8�_^����7�SngG:�\rz���(�+��G�9���4<�onbP��s�s�@��w�`�_��+����}0�> ��N���9p.��C���:.u�9%�h=4���6������T���z��������I�����w�u]�fVF�@m�Xm�i��e��fbTe���l<9��EI��$GF�-����X(�'$��q��:�@| >����������i���=������U[9R������N�
{��y�G7�'%n�����W�����3��h��!��A^�y����]��Zc�{:�(B6�l2�L�R�������� ������;�T�q��f���|:z���ch���*�����h2[������]u7��>5�rV$�F��9�����=w<�O�������<,fk8
u����*��+�@g�&)P���bCZ�+-C�
EP���(����q�8J8�8|��*�ac���q�RR� y��������h%�A���O
��)���Y>��B��.�K��q"��:����A����U/�D=��N����4x���H����s\���g�:Sukp(�I��
�������A9�ow�*���q(���,���g(v�\N���v��\sx���v0^���s���M�4�"9�A�R��_�`�������+3E��t���}�Q��)y3���d�^�����[�'h=���%�@K����bm�����:�l���p.����9��N:���%�~�jn�����`������xd����]�����Y���+H�',O�Fr�M"#;&�1\���I���#�����)��<<O���u���C���S$-�P��WztgP�qe�V@lL���
��������k@�w���n>*������K'�jN�}�z�p������s���<�?�v�)�������&�G��{�z�y�P��P	������su��b�%O��F�*���$���:=�A�c��Mo��M4���5�����d��n-O�1�zT[K��AT���^r`5���G���<�D��\����������������5�C���,��.���[�(w�����c&\�ej�#����,��u������M�v�N++�Y�sb���I����:Bw�:����y���u�v�V��_N����h��c��8-�Y?}������8Y]=}�g��X�[k���C�nI���a�y�p�<������W'��R�����.��s�U��a>��ur�,�b�'A��u\�<�u���4J/���k��u����U�e=Ok��8���B���D���Q���0��i���I�|�:��m��:������E�$|T�����\�$���s��G����98z�C���7�OeE�]cI��:���3�F�>v���-q��~Y�����_���p�3�����fT�4J�
�����Q��(8zx>����	b��l���r�\�*^UE+����v�~R�H�m�.y� �I��"�{�(��O*��p��>j�%���w��,�����b���T�����#�.�A"g�j�.��r��`���6)�w������i�R�W|#��[���/�:l�����6]Zy���^,=F��L�q�2)�4���,�R�3��0�Z4��K|��"����)�z(h�5U�f�6��������U��.�ai�����}#��7�A�����p-?P��@����,���p-?�����-�R��>�I��&�Q��w(����q��pE�DTE�",���r�`�g���k��u�i�b��R�!����������&\�,��S��
�rh����S�~���ba���Q7&�lt������7������"����L���,����=���]�$l���1K��qL�t���4����!���W�t�<%����������(_�������K�$��:5w�Sw�e���:��{����L<8bK
x��,�*����������,�r1F�7���%*m$i��9&�9��PuIH6������X���,x�1�q�i�D^q�@��u]3����]Ko�6�+s��`I��m����E�(xm%q��^Y�v�}EI�d[1�o���r��'����M�K�huiz�/0[��<_�6��]�������8��pR=O���'5���?�����Y��D�l��q�i�?��������[�/���jC|U�'9�����o�9�?><�O��y���H�]����L8���l��.�_�l�E[��}��J�8��.��
����@k*��
������T�5�i�T�����b% ;Y��gB�#w����F��q-5�fr���Gj-�z.���6�����0�h&a��}*�����A�6�:���(i����s��7��h�������G��n��s^��;�}�t�<�h�-L�)x�3��9��-��#(�)T���>�cMC��X���%�a}��!�G&\b}�)�G4�b}��#�AXa���� ����c�d���L�4����>j��4�c�1�W��DK����f'��x���>����G����}����[���
Y�������@E��+���'������Q��trL�?H��� %~AJ����@J����@J���/H�w�? �#�!��.x�d�]��<����\��++!���=+&��c�Z��Z���
!`���-l+�H+��3s�����h������h������h�����GoB����o������[�	�M���Ua�!YA�	����/�o@#��F|��g��&�n@M~l�`��+�&���Z�e{���F)_N�A�������IhGx[F&
���

BO��N��"�NNi��Ek�6���\������n�Z1T�7Gz��T�'nUy�������N�u�J����i�5��1�+���8�
vN����uMR�:/qW����]�n�"yuJ��-���L�@���NY�b���]�x�z�r��42�Zm���m9�x�c�u�N�9���C!��5��d�iO������D�	�	���P+$Y#�*�U*-�:�T��B�8���l\����%%k��������F�R�1��^�������&���M����b��p��4N��3P���3��������z���-x�x��|���/l���H����l���F�O��F�Y���A��EGS����R7]��SU�S����U��z����D'�>d�Ml��"�.rM\��MD]C��Z�K]<����#U�c�<N�E�Aw���60�DY0u��L]���%�1h�>���,D��x�����?oi��zR��O�;����n~��9��9���r�[�f~n ����:*�5��.���G�(����L���4�����K��� 4��?M3��mO��G�������u�F:�e�n����h�����i@����?\��O����Z��n����2�����_��u������,.�����~>[�4X�H5�;;�)~���i4]��I��7����F���]nc�}�X�Js�����|�	��m]�����#��z>�}�6��L�6h���l��vl��
j�`�a�pVjj�fT`���r���I�wa����.����qS�tS@tS�[����,�S�{�=�jV�Vjf�L�k��/Zm����|cm��}��N_���8�0������n�7��7my�������M)a��X���U��f�$b3���`�haKx
���*�>l[�0��_/KYS5?����k��u �W�Nj����C��<�N���V]f`m���c�@��0�l���~�9'��maQ�[�:i��|�?����M[�O/F����b����u��1���.v�u������-�V��oz��`���B�����^�HE�p�(pRA-+�������6)�z����E����)f8�5���^�;�/wG���W�m���d�OX���k�7a��L�����������*��A��>��&�I^;��
�n��|����N1�Q>�/g�K�n5���o������Rq��[������(��[���2����G�c�k��m�y�k�v�����}��o���X����Y4I�qcJ�����NjW��S�����]��^�q��I^��^�3�g>�T�|����aU>�7�����q��C�:����	
�vE�?T/��Le�O?�);��n[��������Q�3�+>��d��X���b�1S2s���LgM��U.��*�V;	��$�$�Z���(^\����Fk�������v�����L%]��g�Lcg��M�y�g�r�O���j�k%��n�t%.��T�E��#����&���)tt�N�>"���	�VA<�x��G�E�BSJ��{��6�w��)O�I�n����������P���n���)�fU9j7e��W\U�����3��"7XR#�X�����O\��uV��������fB��]3`���G��u�ZO�t�������g�PN�+�)W/����������H���H������w�����-�/�:E<�q�)��)��S�A��t�x��������v��jhoO�n7��f@�(q�Q����YG�/%>t�8�mw��1�/~C9�t��g��/m���r���}�d��6���.�?q���h�9E2�=�)
?Tx�S�h�L�}KaJ��W�y�n�:��i����5�������h�r�&��� ����� ��=),V�.��.�iS�R���l��@��#k�5���o�!^vZG���}�����*?�F�jDy�V�*Yk��|���LMx���S���O;�V0v{���X,T�UY�n��`����=`��"9-�-��1-�)Qe��-��RO�1%�hb����yixv�z\Q�
�\���| �x�q��\��F� ���j~�ta������h.��j�	��j@�:��0`���
P�k��mt�n$��:"�/����`�(v-\���.Q�hQJxh
��� ,��5F���������E4�kv��
RX�(*U(Z�'��J�	����i�-��%$�;�!���W��,JN,J�Cn�B��Q<�n��fK��u����sB�T�	�M='{A��U���h��91uf��eU�2_?
2"36#b���7��;����&�FSS�;r�[�e��'0p��r~�,��;���^����,��b����^\��">�v~.��n����O?�f���N�Cl�6=�-?8�'�D��'!���m:����7�dC���
 ����aU�
*�p�
�A���nP�7��T�
*�j�
�A��A���n�-� ���o��� @o����BT�j�9M���U�D��m,r�:�"��H�j��R��*X��5�Y��5�Z��J���:����
���~��-\��h���1l����y4I������Jj= ��_��DK���G}?w��4q_\��]���oKw���.��\n�1��`���t7���|��������4��Z�f?Q�#�D-Q�����������E�u��:��:��:�ka�����h���ZG(�#��V��9������npAn6,�������(,��0���*k�[�Tq��� 4�����t{9��aM<�Ibm1�F�������[w�*m]����T����e����Q�7Rt�6+S����������,������!���b<���a��i�+(���#f:c|�e.��/�����#1����|�������5�����K�f�*?7gTIK����aj=�h1��������$���\5*G�}���I:���T
���/Z��Z
�XM�l���8xfb�����F#�>����>a�hGD���0���U(�D��O@����#[n���
~�����k���3��:Y�c7^���t��%)9�m9Hu��%vv	�H�$@	��H"H9����w��#���\���7?m��I@��<),����b��^�B�5���,��lx��Ux�7!��%��������������y����z@��3<���S&>P�wzL�3��Bc��.��U��:���u�5�?."k��t$2>�l8�0?����Q���&O�U��%���A���Z��	�����}�t�@����4/���,/��Q�;�cA��!_��*�6���]~�|(��3PF�,m��^��)������������X|��5��������&RD�I�����G��8����nT��@N������L�^�Q�]��o�������E��Y��&�t�!}�,�.-s�v�)�BwLqG+��S��t���
p�)����|���O�+����<�S �T[?
����ia?
��ca=r����m��0"e����Q�����I�
���w��)�a7"��<�����K<������������+<M��y������^�oD�%�I1Z��e��Af?U�Y�J�i)��_
@-���P��;��g����(~�1�u%p.���gk&�qw��z\�I��$�d�3	��4:���J}$�;~M�}�)=(G�h��������A��:6��n�-�������Vb���	� ����2��G��"�F%��X�������r�����| m> 7����`Z��[�4|�K��vr+������g��XR����������q��r�ku��N9��tm5�S�������%�k���%�B:~�
)b����c��5C�;u���7�f���=T�w�������M_l�d�����Y&S�6,A.V�z[��u�e���_�?�U���b�L�SF.�"q���p*@�l���T �����w7_�MR&C?O�W�y�ev�2����O\�3W�d��qGR�g{D�bEE����7���A�����	���w3�~�����s6_Q��#7[�*�������A�����L�T�J�|�2,+:���K�`�
�����c ��=�v�i0���,��?c�e�p��^y��gi\��\=l{0lg;~���"��+�����=�7�����m-p{�����]c���%5��O+���x@2�f����EP���#�	��!��0�{^`�n���
L�)�[4��10�]`j��n��d��j��j����X�Q3X���A��$A5SVJ3c#a����L�I�j�"U�]�V�v�&Z��XUh��T��;V{�8Uh��k>��MyJ������*����[fB����EU����f!���a��<��a'�b]0��@��`�c�D�]�U����St$K�4B�?B�x�cO��(���cy~�F6�k��=�tC����qjn�;����?T������������K,�x�m1��y/�@#�K�����&��Z|����W����qv��=�����T0���\��d�j�*F;A���@��_����X7��R�z�N�5�M�DNI�`�c�c���c������)q�|������ee�����b��GT�X9],�$],��.������K�|�dH���t~�������!�:�?������s��|���s�C{3�yV%�%���P�
0�6�����$o��P�
�C�G�D^����'[<��F�g|,�����~�h���������{�
����=���A]`t=�j�FA�]@��t5�m�e�x��H�&��.NM9J�&��;^�D:���.�P
?��B�+����N���l;����8�t#�5�H�,�fT��y�����w�M���i�a����l�2�`�1
�i�n���bG�x�<_U��FI?�1.E���+u~����w��xA�_�QP�����;�q?:�dpO�����d�\�#�$"z� ����|���1�`&��s�����������
N`D���q��e,��/<�t%O���IuQ��q��F4���@:�+�Y��a���eI���A��AU��/���g3�E���JV\�����4`|�3<WT+�\���������e%���ho����/����/���_`��_<Y��^��0E�����;��p���!���!�p4������JOMb�1<	w�>2-���V>y4j�k�����qXA$�0�*��)�+�,��a�Sz���R�?r�%������/|���F����d�n�P�j�t`D���Z��F,��a(]�����+�]���"�a
X"#�K�M	}e��N=]��'T��~����mu��Z��������+����?e��7�������@��9n��g�����KAp�ka��]uC"��9�LCa�\���)V�G�*}�X:CDk��1��)'<��|7��<
����B������) ���+w'r�c��+�*��/��o���^��Q�\l��I�C(��!�nL
�~�T�,������4����cC��y%�W��n`
QWx��)�5��
M�
|�64CJ�8���8�-���O���8Oe�o�=�F��*<�>ZJ����B�45O�w���6wU�"��BY������j�*�F�c�i<�TU�2�,�1�<���r�S,2S���]�����x~&N��?h�y�#�w4s#�=J���g�]N��Ix��F�$��[�N<�x��xtS]������k�$C	[�����e,�-��i�������X�9�������b����
`� ���,�X0x�<<��s���{�8r��#0���l��%sM��x{`c������:���x|���0�(��|�=1����t�g�x��9�~����T������9Z�
��.D���,�Np���R�E��X�$`�N����]�;�y�.wu��G)��V@��uc]�����B���
/yY�[9�����o*.�~�����V�������M���y�D1jH���-[�����/S�����+yt��>�������G�x�d<�2 ��h�$����x@b<@2�P�@1U�}�h�n�fS�0k�}��������G���R]��,�Q�����pO}�|'����U��nK��p������<���C]��iT��k�}[�����=�(qSk�6�@i��1��]��\E����/z�
�7gQ<1����J�K�X��W�c�5*�s1*��u]����2����(}S$<���-�U�k[$�,[$��Il�%*���E�q��ER���o���<���R�����O��D��Ui�$���-�$~a��� \XKN,�-�2HRkJ�4�W�H����-�0Rk����[/
����k�
�,�VI�k$~.#���$���G�z���Z�oY��jb�1�'<#R���C�E}�@�)D$1�����kO^]�aeD"�p=��PN�����PKN���p
���3��Si�]��~�^���n�� N7��U��������6:(�7�g��<ysHc�n�5�1�u�
�����]R
$|ps���~���ZNq|73�,L����>�?�M���c�����dh�*�A��5�w����e�|nyc�)��,��&������!.�r���m7���S�;�\��������s2Y����yp8g�U�q�N|�o�r�,�r�4G�j�UM[o��u�<���!D��b##Z������w%���@�W�2���P�oI9���L�T�K�E���Xe����� i.��� '�g�C����n�����YNc�����������fg����jy��b�����'M~���;���+�4oG�-yq}?Neu���u:&�^�X��k���xJ�68��)'�n�J��h����u��#:�=���*��m_�X~��%,e���f�?Rx�:���D�\�W��������.��2cE���sb7�����F))k�=T�95��3b����}���o������MQ���^�
�~O��3Sn���VL�����<d�� �Z�d^��k���MS���.��$����A����L��3
��_��M9��p1���y=#o�p&C��5L�L���o������NO��*]��]J��;����D��u�m���f����H�b��P�G:�>@`+�t� ����ag6r�������k������Z$���p�t����ae�jkS�y�3\+�H%���.��V������E@r�����b����3]�A��	Zz0�Tjxop����@����$��Ss�b4�#5�DI"wfUo�Pp����@p
n�M(�	7��&�����{8��i�8�Oc�4�UBm�p6�0]�K��O�nh%)�m��������'�)u�������rj��T��F������bHj6z����J�5g	*t;�R��|�_r	��h�~�,�w���H�����<F����c?\?e����{%�6QJ�]�v:�T���cO�����=p�i���t����Ks�:���a+A�� �J�tz%:�z{H��n�C'��Ns��H���������M����D��<guZ�T�e�k��q��];�>�};�>��n�Lzp���N���YT��P��-sr��L�7`R��f
�5����Hw�s`` ��B�h���PPx�'�!���hK#�����(�]���^smt��n��*oO�����1�7nG���W��;��3ku�X.�s����>},"���>��'�x��E���NG\b���Q�������&-��8�:�\�O��C����������p%_��d�@X��O#��-��jw[-B��<��@���:vfC�%���D&��rl��t:����2 �o���QO�k���!Yb��W����[v��Ej��F��GV�_:9���d�����y���(���g2����9�n�K z��#�YZ�}��|��<wVP���u�f�����w���6�����������R:����0����-	����#2��w�btI�uF��U��e��S�N�6lpd?f��iG���rk��b�P���{�:�2�>���Yf���$6F��&�
#�`j�E����-��J>^��2ufS��"a�#x��@�
<��G ���#���G���Os ��;���5V~�K���U��j��'�t8CBDYf���������tmD��3���3Z|<m�����L��5���t��S��]�kb�6i��W���������vo���53���S�s��/=��u%�d���X8[l�����\�# �ry0�A�9���NLZ�t���Y��
�7�Px�"����ai��X�����31@����L�*�b6����!�@��PJ(B	F(�%��#�X%��W|����{%��)�~Q���E|�t4K�m'���8C+^�n�p�����3��#���� �]�{�����\���l2,�����,�9�-Nh������8������n}�[����������J�3��1Cn`��=^{�������Gq@V [�Bq@\��uTMD��aj����@1G�l#X�����@+�(.0b,G���V���[���N�
*�t�����@��A
9�U�V�/s�L�������b�_A�����
�������~m�B�t8y,�1=w��W���:ce��������_�^~K=������N�1J��-����T��*����K�p�����<
g�p���NmYb%\��8�������U rN�� 
���ci��:S+�0�:�\eNo�v7�`�E�^��� ��]�u(UB�m�o�����~�����v�m{9�m2���:�2�d�B:����@f����Xf����pc;fre��(C6z���d�-�';�a-<,�O���[�`���yXy>�|�]G�+�?������5���Z�����
��Q�0��d#�/�+�
k��B��il���������4@<[<��x �x �x �x��StH>Z�q �>�CP��9;�2�L@s`�	U�`
W{P]{*��@V�
�J�(�������p8$��2�O�)
���4��UmL-.4����
���tm�������g��	��a*���I��I_L���I":�s�y�p��`�����N�������Z��W!���8rft�N�#��Z�)��rk���e����l^J�t�v�$�g�J�u��&����T�N�l9��#���zd^=�V�ZW�����Z�*n6�r�,U\�v�:���sD��Cg6B��>����:����x�XE��K�t������U�avR��e���@���O���A� >G�#�@Unb��@$��a��g��F�������a�%�:��&Hg���g�S��K��I
*
�]#Z8����n��&g�����������
�����`P�o��N#?��O[����t:�,��=J&Z={�\rR�jJu:o��<�lu��}PU7����t�m��k�vT{
�#�`�.���T7���B��I�le��r-��3���&.V���b����oI��'|u{�x������(���������DqZ*��<�r�D���qwD��@�k�~,V�
�a5��n{�g��*-\�"4Y�9���9�q��1x���SeH}���Y������M���j���J cN����J��L������B3=�D,�>%��'�	�y��)����9��c2�}^U��N�V����_���c��P��;hT/���k����ws���:�W3��Z�?�ip����"	�i+�q���d�������sVC�f�
\5���f�7P��C�<e�j���	S�	��	U��a�"T�%X�%P�%HU%\)��:T��Ck�f�q�:�Ylnybp�����(9�E���=��("k81��lm��e�8��	'�q�����p�.�w�N��8�J����<&v��s�x���^�������opG� ���S�9d���C(rF��!9�#�X�!9dFA���!A� ��.�sU��
t<	zL���nE�,�h��_�7�)q&�����)2>�9���e���
���-�>��Y6�������{bg=����<�������?���)���u�p�G���?�yO��z��O���l�����_S���F���5�Zd)	���o��Hr�������96�r�kY��G��|{��b���!���"��	�$5H^n����B��US(T�`	��N:c;��S�C���]pB��5��n�Q4�0WqQc���O�QI�����QL�1
:o9=�m��;��X}��j���%��G��l�`�eZ�����fBmt���@��;$"����'r�����H90f�J$@s�KUx�w6�l�%���0S'E
��>��c�`���f�4a��/�_��4���)
9��4<�z���w#��f��*�L-24����/U�����J����xy�cKJZ����}�T�
�_��#	���S/[��D�F�n��WQL�7���5e�Q����%{�Ox��������������� 9�����s^�~����z������M�����d}���M=��6��:N���2�9�WQ����V_-���.
y�����������l�U��X�J�j��Xz�
G����/��rG����&p:�
�[A��/ew"���G�]"6g*��i;�_>�70����B�W?Se0>��N�W��
�9}��G��t�>�Z��Z�qJ�eN����\'O&��f�pUqA�����t�������]{%	�\�O�����A^~���_{(G��ts=��������6�f�z
Xyz��t��r�j�V�6�:���h�Jr�K�I�:%(
�����1��y��}/��>'��t���"Q�r~����u���U:�P�;r&�������v���M��~��95F��.e>�� Tv��uj�c���<6��[���lE�s��]~�m"���,�?��TY�����)���(�Q�$/���;cAW���p"b��� "�*����T.�v��XMD�e����X������{:3��jda
19��������G��,I6���z(
M��T�cW�j�:��P�J%���7����<�|�'U�J�?���C�*������n����E�������+���`�?�`3�v%O0��
V��U�J
�z�x�to!Jo�T�m��T�o3�������O��g]�����Y-���F`�R���������jg�����{����C3
a�Xy�B�n����(�?��}�C�N�zu+Ri��u��x��rd�S0����;������i��+,!$��f�
!.��z����
4�[
��I�i�431��?���o���xf���h���)��GOy�A
��a��!6�r�����|
��5��=�P���_5�\�8��"����G�/�$�]2j����*-
�4���
E�"5l��c'&�$b@���G^"!HsF�1�I�����|DD�0Dp��	X��,WZH���QSi�s|���q��������V��"�	�q�Z�����
����2��X����������!��
3C1S��g6�f������Y�TI��X�!4�����w2��\L0�LW<N�=��b�9�����#�e������������Z�;v3���A�\�����qlQ}�c���<�3L<��?������
�v�w���q#{'����o<�}�����1b7^�	��)�a�:^p{��,|���<~���{��IC�L}4t�T�aCwR�sD��5�a(�* ���n����2 ��F��&�f
���l#p�m�p�F��
�@���.�^�d�d������h��m[�xN>��W2��?2�O����*���)�����	����Y��fW����CL�Y���m�4�8ST#��o?��U��(k���K�����L����h~o^����A�[�p`V������DS��(U6�����W��[�s�-r/fc�~���?~y�����PU�9(��1�������L�9h��n�A��K����:����~w���}?����"��8Sz�������Jy��w�M�i<��S0g�FX�N�u�vN�������o��\� :8���N�����t��;�������y�_2-��8`8``�2��2�h2�m��N'�)��
P�����B8������M��Vb
�����q����,�E����\h�Q����9UT4U���Z�u�$�T��]=���3y��������d������Y�m�T�����������6�3��3��3���&@��?����8�'�����N�`"�v�;�
��s?b��>n�"X_Ch�&�Q?�Q����8��_��1�,��O��r����*D�6Q(V��HH�7�����W�!di�4s�N��lv�n�4����I� YS�Q��t��/�O���Y�O����;��
!,���p��d��)�p���p Fe��E.���g��n�)���	`i����
_e����2�Q� O��C��(O�F��������d:�&U��������=m�\�P�f1����97	z%�B������4�����T6U�|�&��]�F,Y4x�o�L��p����<���G�3|cg=�����g��O����|��l���S�
S��(��)`k��9�T_Ef���p<��2�4�����J�@?r�'
��������O��s�O��I9K�>:]���nk�l#����3d�J���d"�K;����t�,��Le�Nf?y!$dc_���	,>*�z��\?�a7w�F([;n���g��fyx�]y��e?+qU$w�D?O 9b�=�sQA�L��z�q��#�?*�������s
rAnYo_�
������6��w���"����|��	qK�AW��2O��Di�t�8?���:�>`EJ�PS&0 �U�����"^��Q��c��c���8�KD����8���a�����z��l�E���b�S��B������|\
���S����u��B-��������-��{�J�����g���R�s���7����r�o��^�
gz���5��g���=�����Fc8�������Re�}.E��?������X���������B��rWSv\Ve���F�&N{�����!��|���p�>�p���;W �/���2��"�
W4qB�����
l�8���A�:Z������t�C'k���f���������I:�����k^�1W��2�����Ag
<l�Ia.��e��k����@Dil�/�a,I��������L�'Jz�l�/?�/uN���`�u_�_Y������Z�U���
�����F��t���^p�;�^p!�U�����5NW�X���B���l�@����F�
���G����Z,W!����F�j0��Xi����������ow��������pU:��i\};��A�0+`;�_��7����n3?^�����s���y����c��n�$tP;;	d:(Ak�����/���
DO�~4%�!�	r��Sa��ZA�N�7��N�12f��O�tw�H,�_�}VD�I��������/�UbP]M�V
�`=����@�'���e�Y�(k��	��P ��N���0{b_I��
�Y������(�wDn����rL�t/N��'�������1�yx:3�v�����f)m~ �8?��E�g$F��L�q����m �8��+P+�m���p��%�L�|��:����N+��b���M��n[}���!3-�yr�:���}G��`�8�����iK�F��@�������p��.��A]:��8S,�!��|C�d)Tk �5vpB���B��dW��FM������H�p�"���-HT���- �}�vp*'����J�s{v��05n��N�
O�4�Yn�k����4q.^*,�\p_�������5���vg�v� 	��ggt��x���1����r�����l�X}!�t�2��F��a����$-26�K�6����l���W3��^����{.�c���.��r�j���+��c�������(/~�������R����#����y�������2�fWj�I��.<N$D�~������onfU7��[�Ng�cIQ�t�|��Qg�=��F����[uT�y�����w�3�f����ai�JV�`G�O��L���C�a\�"]Rnt2s�=������[��w�?B*�E.��+w����^�����PN��-G��9��E��O��hy�����7xV��0�\�7�&�Z��t�K�v�;�����s�*m�&sY��EM���9�:lTPW���\��E��/]S5�#����J'���U�z�qo�X
����<�\dy���$�L ?��L��_y���6*)'��������k�?_f�J�zc0���8A���'�3��n�c��b&��i���C�<�)�"��Rn���{�0\;z��,���=�Z��������Oo��(
i��6�2Z���k��q&f$��?�OR������L�Y���Q�U��IU8a�usr�y� 3���3���)5�_���%0�g����Z.������5-���+?Y���WFm�l����j�zt5�/��������?2(��.]�t�]}�K&�"�/@��!��Y#�N��u{�|�����=�eU�/�j{=V�5��PK��6G0ty1�K_perf_reports/12GB_preload/ps_4_workers_12GB_preload_2.7_selectivity_8_task_queue_multiplier.txt�]�n$5��S��"@"C�=���"�]6�%�,O�g��/��&����p_3�jg7�f~�f���?_���\~��,����a�0I#f����*e�����?/?�,��9�C�� N|�1�S����~b#b.�1�y2Z�����|5�LG�s����,f�3d���dQ����#�m�<A#�r�~d�i4]���/����:n��b�����~��_~d������g���x����i|4��7���=����7�O_<�������%�&AB]�c^]s���l<��f����Y��9}���'��D7bV����8������dg��O ������rY���\�����a�������o���I��]Y.���������W�V���������:1#N��;6���1���o�}�������z��������p�m�U6���c#��?>��1l']z-������=(;%|9�^�M����y�}�U������
R��C�Qp5��Of�x��,�3����9��a'��Q�j~��)�71+1�?g��&p|���_��b��w���h�>2��p��0\gcOF��(:��1��wS+�>X���_nj����s��g�O��J:>^OG��<���u]�8~��2��_��)���������bV-�������� !6��G�zK��(g!g3�WX6�4>'1�R��v�s"~��n��@�5t��`�jl�@�����1���j�l���\c��	->��n���F������n���*JaB�k)(B�b���xv��Q��&���t��Am��M3�G��s��Vh�2��C��&N�uX��/''�%�������&�o��u�b��C��d4�b}���u����aTC�I��c%����/EEY�z|��c	M����(na�?����u�����]��������\�v#I��|�[���z����v�����M�g�Aj7��Fx�$����M1s8��
s�z�f�{[�c�d.�:�w�-�M�#$xQ�$�����"�"��R/K��h" ��w��h�����=9��G����#�je���8����V�%��$"�����K"�4VhgID����%_��-�d
�FA�Awd�^0YRw��_A(y,Vu�vi���md��w!�T���L�<��\��l���VH	t�%0`�-������P��b4�7�Lw�*������;��7��m�c���1�w�3�f����/L���`2)gO�����N��=~��X�M��g�V������L�;j����jw"�;��,����^��w,������l���.��G�I������6����F9&��������M�h1Y�[�5�#���/'wSZs��`�������3[�r}O���;������~�U'����]�X����Y�!�1#�B�{I���/�o1Bw�qG�2���I��3G�)�;�` g�%���<;�N�Jh:�~`�����(�6�-��"�%����z����`��	X�#�%����
=���W�>�����q�����jD���q��:�����wh�k��N����	�����y��Zi�|sJ�&{���f��O\�
������PR��7�������G���JA2b;!R>�Jd
�QA1�D?Y�HR�`{R0D:
>�1k�E3�'C��U������:5/������(�(B��������n�LB�M�PrX��
����� �*BAd��"�v1�	�zj�� WG�T��x���ib��S0T
I=t���{x_X�N$�C��P2Y�;i��(R0LZ�
z�	i���MH�>`��9I���t��((�+B/w
K4��uIe���@��9:��2.0T�z\�����6
<>K���F���&k��^=]��� ���#a���a���G�����_S!��)$	�����;<0$J�����-x�������a�F@�8.������]N�=@W3X��K���|i����SK���Ne]0Q�=�����>�'����g��)�$������,	� I��C���R�ru�~K6�.�����|����k�h��I��B{48U�&�
�SA�A���F�H��h1l
Ib�4Ay��@ |��g-U�4h�;Z�2ARAo���M���P����n� �U:�#C�����7��e}D~0L*�� ������=p�_A�%����<���}
��%�B��������^��!�l�"�� ����Ap������O"pE�0�@�x_�������v&K8����h��o��ZR�P�d��;8`X��*����{W%�������(_�[��$D���Z�*F�I����s
�d2$�	��$C�������LiC��h��	��X������9A�^���]M��)�4������|+�o@���N���1oZ5�����!�l�5�/K|����0�%m�����h�G�a�h����J3�Lv��`t���������
Z��A���+�+c��$���7��C�@� Q.�~q�$�2,W$#�l����#�
|62!��A�O}?������@(���'i3�Kd���!H&hm��&�����I��2�?�r�����"���s���)� z���7�}cYh�'��pN�p�K�p�>������sf]�m&ec)7SA�A�c/af��h���;ua��$�$>�%�:��XC����
�i��c%��D�M�������|��A>�5�D�a��?���
�A��:�����2�0H�4����&���CV�M7Z�����x��n��i��Kj���L�c�
2��-k����6�0fe���v��-�K����w<�.��T��4������4x�Hs�of����;�T[����^
K
�Q8��Dh��\3�f���u�I�}D��O/fe��kl�4�>�����&Z3o�S���5�v�$%��z�
S���1�
�{���5SyS�#�f�v�����h��.]%�!MW������������5�^n@h��s	b�%�������I�f� T8a[;zM.A�l�	��)���w&���Qw4�����E-����M=+�9�p{|��U�.���j�#	���6����1����/��v��21�����''��
�S�g��$��t��+��
iW��������Q��O���*%%�1��k��R6�j5a�����T�	%�0���X����\��t��|R�}��`?�ZsJDo���<�����V�0�~�(�����v�g��oW��������v���q���4=��,��3>���.�FK��y��/�AXFQ&���	
A��?��~�~���Q#h�6p�;)���p�U�rB���2������!k ���A%S~/��TI5	�"��*l�|Vr����a��1��q-]Q>�$�O�$�`h�P�������IcT
�=
�K���l��IIu��7db��A[�������+*T�-ES,�EPh�wS�Tx����{�H�=1pP������]"9���x!v��0�+�����|W�I%B�r�WT��I��8��1�dX�-�UtY�~��������z�W�@�/��r�<\�������v2��D��x��v�[Af��:�����hu�8����
m<��r1c_���g�����I���-���&��ZV_��p�<��O�@��m��@6��M	R	�r��?��x��F�[
!�'UMV��% ����x���x������`;����b�@����G7�B��/x0��G�jB����K�j����k��R��f�tKLt�U�G=U����S(7!�����<*���UXE��H��O�����<3M����T���B�n�����2���=�� k�sw�u'�5�A�U%J����=��!��j#�Z-�d���(+�"�N�zG%�
��;��QO{4[���@����>|p�B{����uK����m���-zW��kO��S]/V����4�������0�D�E@�����0�������LS�n5�i���2�!T:Vp����s
k����o���O�*7�"/{\�����u�������2���l���7Cl���
`8��}Z����c�h���j=�k�	�lV���q�����G2����iy�/7��^q��h���.V������py|��E<@����9��6�>}`�)W�����$����@/�-���N�w��/�+���@ukI&��� ����TX�(N������Z-m��]2
��K����_EA6��jV<������J�^���6{$��3�����.�[�Na|��U���{��_Q���������~�8�!�HW��a������[*!v��X��.�0�!|�>�����FE���mD�J�f/y#����,B�����,Y.
�V>�G�>���'������H�T��
�nf����tx�
OEd(��4w@�Y���T~���m	���\���E���^�"f���6G/�N��yR��F�]ko����Z~���d���Y��Ii�}�`�AL~
I��LH�AIp��*y �$�5}'k��V�K�t@�>>��
���u*���!m����x��������q3�R��#hPy�"��i���N��������4�G4�'�h�$kG.D�$(�}�.���I��m�B��^�!rL7�p��:��n��k!��Qs}V��E3��i#�l`c�}j����;)M������PH�8���/�%�<���	�����*R��h���EN/:����e���A�k&�?+?#�E�0�R&G�N�7v��V����}���;h�1���������MRE���.��K���*~"m#<MD�3E[����.C�������;|H\"����Z�r������C�����,�B~�\�t��_�����sWz����P{���`��U��[(ROV��U�����5�
��q�|�dH�a�i�L-Q�f�zbv���u�y�r"?�q�f�����7����PA|��
C�I��������a��������T�l���6������':J���{������[���u� �������[����t5Q�����29����{9	��.�_>����H/���0��PLrb.
+����6����#�;��Q�h��y��ne��L��]f���aJ����_-�Q�7�������8D�j�)2�6P�{ [����=	���K9�9r}Ha�$h�{1�7wG����a��lO�!��] h�=�D�
�����Ak��-�&f�Ga�h�hq\��}�_�����W������P�7��h�1G���:Zo����%�����*���
n(�T�x�2*��'�/?�=�P�Q�R�2��W��#���mk#T��"]��z�1Q`�X���C�'���va�jbr�:Vz��/�2�� �&��p��Y>�][������F)t
�[��6�o����-�Ve��7K��kB}�S6�h{��S�vaS-a � ����'����lO����I�;��Pa�J�S�����
#N"@���m2�w�����%�-��;��O����=&<~x�"�s���+U�A�&y���j���f���Qg�x��W�Ha���Gf)�q�@�
�I����~?�D��h���]��������f�����#4�=�BiWU�v��t�Dc��k|�|x��+M)���_��AS�cdi�pk���n��q������4j,�b������@W���:��f�p���
���o�<LwP�["���P�������}�����
�Vv������zaq9�]�EG�7�������.�n�)�Ji!�n��a��KG"�CK�����u�H�#���\�����I���K
��g��fO���t�u�J�!T�yr�T�y���Q����^r��xO����zWw�x�N�$K���'u4�\
)�NY�M��q8�HlF���m~$o����5��_U\�$�R�t3��:B{���mo:��Jc^N�7���1��&���Y����NF��$<u�W�����0V����
[x�'�����|�����.M�	!�<����{1��@��Y�uI=\~��os�'�^�E�i��^����2������.�s�KO���d�*Y��j�����u�Z�����o��^�!'_��)��i hpy����E�������`��t�*���N!~��� <��
MY��l��u-����G�1N�"��Eu�'~�O�	�8��g|��[S�m����T��8B�<+h��]���$���U�wBE�^�t�3��P���]����������,�i�p����+��A
�]Z�����	���&��&�p�lri������TTZx����A�u7�^�5��Y'03�8�7y��2�5��)p#}�a���G�3�fI`i�����]]
!	���p�V�
���#
��xA>������`�����?���\�7���5����]�2@r�V��$3x/�1O/-�6+�.j�nLd�>��t	�@,w��e��k0��r�E��
B\�1���T�&��X�!i9r.]pp�i�1�=v�#���VQ�4*%o������Hw���!e�d<�1��>�)��`5����kj��X�/��z�pC��G'
����i�[!�!�HY��h�z#V?�����X@�D���9�A��\.�������NKI��X"W���m�M�W,:|��	t��
��r�i��t�)�A��'����
C�.����0o�1�D?kH�yj
���z�a�:;�Dg�i���@q8Le�����rO��5�:�J? gB���#�m�pWlqn��NDE���������W����,�so��F��r���m��H+
I��8���<���CH��^^��t��gqI��XTU=������O��`�w_y�g�W@��uO�w*b�mh��A�+������%=���
��e?m�OT+�.���t�6��E��n��ac�ec����.����'-+"f�H�y�����_�I��X��@q���6<�PK��6G#>���[J4_perf_reports/12GB_preload/ps_4_workers_12GB_preload_5.4_selectivity_1_task_queue_multiplier.txt�]{��4��O1B�t�4}������8@ ��l���M�^�.���8i�v��g����]��'N���g������z�M�mQf"�4y
����-8LfO��S������u����������i������knM���li9S��D<���so�b����p+�<J���V(.�+l�g�5�<PH�`[���8J��&��K�����w�M(�?~�"~��w�'�&������f�|f�
`7����Q7�Bl��HJ�_�|2���,��"-�6b�f���s��������d�M���{e��w%�U��� �B����-L���N��[�X�z��Cp�"��V~��@�&��J���&���Gq��?�?��,9�(����U)���'��F���m�{��?��������}����<|e���|����|�3D�:�_��'P��_|���(���}��������%~x�o��Z��r����K{�~Yoc�g�AZ��w��6K��w��l����w�,+:� ������f�'Xzx=��%|s��
�^��7�i���v������>Gu����������D~T��~�~�k?��r!y��=*����[?FF?��8��������>:��M���������&��<a��C�����5�����[����[pg����i��b�f�(����3�B���O����0��r#�(�� (7��s���o8Vsk2���k��r�pku
�_���u��q~��=�!��^p-B�������{L����-�7�>��^PfY�"�������^�mU����+k5�������/�r]%~��M�J9Z�nLk�!�'!�	�A�n�I�@ku���V��M��6
M����f������x�Om�y�~�y�����m���	���W�^����G%P�d�t@B��,LH�F� �8��UV�����5�������A�H�5�������;:��[����0LD�{����FI.��Nlk�h+�9[�����q�
�K�,QSrD=Y���:U��j�t�M`*
�L�49L�]�l	�Q��p6�J/�F��A�J�:Lt�X�#�#mW/{Q�0az���{=��d�Z���`G��}�-�j���$�������u?����Q���c�u�(EI�J5Z�qlq7�l��Q�"���W����G.D�f �koD�S]R�*�U����2��3U�]
|����r[����������7aV'{X�K������E,�e���b��r&�h�]�8�o����3����_�&I��o�&�.���T�C�7x�&:�h�����y�'&i�������x�3��VT/��BC!�^����W�o_W"����VQ
Fi�ZS)P�����v��38�6A@KHgQ�6������wV���Hj3CtQ��u�N�N�0�N�iwo�&^k?��V'u�tWPD�~��G6�i���j��8�t�Z�
��mH����g�^�&���m��Y�U��@e����2���a �hD��KP�H��G�q��� $@��+�\q���u-E��6]�Qe�VH�^<h�)�p�I$���-+�����hkes���>r'
�Q���lK��F��_{����'h�q�.���_{�Y�1�{:MW��f��;(�����4=��=����~������_sv
���c��b@�l@�_-:\
����ixm�2��*'�@S �)�Hc
�1��qLA=�@������;\9�]zXUd�tK�]����.������Z�y9���Y�\y��z-�*�����?���F�	����d7��3�J|���e��������T<���oc���!�cJy,�gC��5E����/�p�
��Q������a����������P}<-�V�l�Hf��m��_���ayI����=F�����Y�$��=v���nd��x� ��<9��3bd�	f|APe�T��!v�����5���	5�
!��4��fU
� �������7�lTMi��2����4C����3,��J��\5$m
1�\Gk��Sv9��D�cR-����0�x�K.6R)�)�S�����G�;���v�+�4W(�j�P�\�����K};��1E�/c�������^rTd�2�EOxn7�H�na&G�	pR������m
���mB����v%ms���W��C4pM�*4�L53�p{9�{�a�������s��H���4��iDEu�Z�R��);��L���j�Xy�Vv����a��Z������E�|<�����M�ks�a���ZP��G�?2���� �DsM���E��_\
z\
l�>W�W��u���2�4���yS�Y�N��?��
�35PN���W.?7�>+1;3y�>�����/Kn_���o�T��e_�����r������?����:���4�pC�9i'�n�����<5{'�f�/�^����~b?u�s���b���D%J%�^���c�l�{���9�n����S��m�;�����"��WQm����}W����s����4��~)��b������M*�=Y��gU��W�u�x� �������J�U]<����p�t���3����rI\_wX+Ga��z#Ok��T_#�J��p���,����9��.���]4���NKL�3|	:�c�g�!]�h�"
��PlY�[t�����X����8#{���a�n@����"����
F�7��r��(�������xrs=�����o�����Lw���T��r$�����Kd4�{� � b��l\�������d�	IaH~�*��|��,�:��C���l���5��{2��O����8Mr>T��b�+��-�Q��F��q���(���W_;�\���P�ma��qXC�)o4��*��~�D�tU�x����U���p�:,J�:��B����������7x����JB��`X1(+"S�[_](+hke
%����'�
l�����2�
C��~P�|���z��rqnVQ9@�,�S|�5�����`Pj�)�	&�&�Aj�a���lsh��k��8����E�9dh��Z�9�i.���i��}A#�
���9u�:��0 �fB8��e,��:QoC_A
�����R����4�x9W��+v&����F{���2B���5��%��XM�
F�0�U�b�rw^L�!�����rn�?��oj�,5���f������u���^����~�B%yTDi���$�����l�:M�f��m���*e<lq�����������9���i��_q��e!���R�UqB��t�a�q/����H7h���!K�)��:��8�&��=l�lTyT&�v5���i��7s��V�I>Q�/�����{ggw
:��t�������M��Z]!'v�M���e��q��"����S@z�@��z�@|��~�@z�����@�����	[��}�Q��D���|Z��m�t��i�R���������{
�|�D���|��`�k\�G��vo����mo��p��o�
w��0�b��f�if�-���l�p�����9�x��^.��s�t����%�p�i�)��������+�
���P��Q�-���Ee����9h.�<�F�q�Q���[m����y
����j���S�gh<(kd��=:�C�C���������s�����-@gaA��c�yCM_���F�6|���@��u��J�Q9�)����d�J��dJ��#{�d��Kc,��e�x@P�d�x@��,�x�����k�+�s�^>Tw^d�(����"8W:�0�v/�c���_Q�w$��$Pq���}�����U��6��S6�;�W�-���*��w��c|wg���`��<8��r
�������?�@FGyC�j�f��E�miHM�e�R�X%��<�13�P3i&�3)�����*����G��^n�8�4B�e�@�tW���	�Y1�*h���k�F�6����c;O��d5���D�K��d�J��d%J��#{g�15������{-�W)���L��dstL��{�|/Gtxl��[K�$�� '[��U�*B�@f��K
es��j��;�8���4~��jX_��v��:l�J��!�\@�>��Zd��L�0�4�� ��IBvV9�+��bU%�*���EV�>�8l����1u�%���a{�1������yH�-�l(�QM~����a/)q�p�� (7��6�<^�@�
�\�0�Q�O����6���[K�������7��^�@�E{������
�t=�C5n�7:��\��]q��P�����v��!5�=�����������!>/e�<�R9���H4%�E���,��Qe�j�*�dsVY����"� V
���Y[2��V�![���dP$cDph�GY�M.��6
o{e*J�,����Q�����!GAcT���7`���&k�w����1�p��>:�qH�I2|t��T�w�Y:n�f���
�3���	�����
�s��vG��@i�n�����Wq�p^������:������..L	�,j�,fZ��(��p(���r��0+��	�r������T��
5R����M}0m��YS��p>���:�l�9l��y�����n+�&j�C_dV�*D'��h.b������mg�2p�TM��t
����+�y��2�Q��G�����@\�BZ�@[h����M!I���BY��8�|3��3c��2����$9����,�9����:��p�0�3�<����:��,��6�u���'���Ge��RUq:�M���7-���4�N�w�
���aVPL�V���I7h�Q��H��8��u�g�����8����:��OEi+���`$d�&�*��bP�*��b�Rj�4��4�������/z>������(�$�~�����qh�J���C��h�2�J������/�Z���%"�Q��$�0x&��%Cy����d=��J����MK�AG��;m�i:����������;j������������V��O�N����Gg�-cp�[k�e�Xh~e6Whh��UfR%vTl<5,��L�m�� J����������3(�(�t��4�.�i���b�M�s����k�����������~M9{���-�9.�@��A(l�T��~?"
��
 ��:E����U1��������rr����;#�9�p	Z>�#� 
r6��/�������QH�u��>������>�������1�1<��Q���v���L;G����?9h���g���k0noI_���T
vr`��q����=
n�h;�7rgVA�U�a�X�r�����<��(n��"�lLGVq�5��o+cF)��o�od�t[�5���0�*�<H�q�
qd�Mc����R��^��`���_����d�8H,��'�$,n�pAO�8_�q$���q������G�8&o�������A�0xn��`84���*���� LE|$	��^�[�#���C^��GR���K����R�pS��g��DK)��JhAd�DhW�&F�j$ZG��4t�L-�5�DYI�Y/X���nn�0����������5G%"~t����o�z��yt[�9��t�� OM_:����0�:
b�gCJ��W�	Q�1�����j����Y^�,;P�K�%����_9�H� �L	��N��s������������S&�8HS����=0�n��P����6����:�l����y���u���8���:��_-�N3B��z
��p���A�$FH��k!�fo����%^�!>> ?>�?>�>�m��yy���:����_C��W�^e�"M�f9�[#E���a�����5zu���W/�k�V��N����r&�T��&��=��O�az1�b��b&i���#	o3�$}����$(���4K�{�#yo����D�mf�u�K�{�A��}y��=�d�{�5^x�75�e9�=�P�c�Y+���{�m�Z2z��P��XF�3�r�J�Q�H.��Z���&H���}se�7q�bu�y2�Qe���iW�y�t�m��O�I��f&�X���$Z\��Us����hm���0�#��G����-���#}9\��I�u��SY�Q
�h��YK���B�<�����2�=+ 
J��	e����A�sq9�����9�=g�N�F���v2��|��(v������=x�E�p�`K3)�d���PI
t������$��NK��/�m&=��i�"0|�����C�;$4��y=i�.?��Mk��>�D�4�����_����+��}�*���*4{���P�8M�kB��6�Y��j~����Q3��������~��h��� 9q0w��{�:��
�:7����:�%Y:/%�@�Y�]5�����F$!��fqv����Se ��;���u���3@Y��-�:f�A���zm��4��]0�3�s�3�E���������j��2�Y�5�vK���P��=������_r�+9j*HE��^��~'��rgA\��UM�c�@�2���@8���e�>Z&�E����Y�E�d��6���)���������{:��5
;��`xl�Wa2(2G������U���'�3������zux���=����b�:U�6�yIXw���K&!eA�	��6_����5��$gV�3#����K�f�F3��i���ir5���)�;�E��h�-Z��@.�p���!��!��!��!���X�.<� ���o�.6P��D��j��b�����_��9
��t3I�3�\��}�l>��N�I7����5���5~�7k28'uj�h3��^��Q�����_6���j�.Oi��y9�~U//~:*���~�T�l��Gs�f��8����t"��Q�+k���jk�����8�/a�"����Y��^�������c����W�W�W�W�W��I����bU��m�i{r�
ed�w��p�<9�`�4����u���.� ��u���2�?�;�X7��xu5��q������lA�+�R[��M��Z�����%Y���q�s�����VH5����k3�i���%p�5���E�7��M�8F��?nAF�?+2V��M������q��,~�_}���K��������P���������j{-Iw������f�2���V����T����i��6��2�_�C�E�<(�?���"H}9��7��I6�<K�6��J3�d�K6�x�'��m\� ��0I]�<<t��AsB;���"��GN���&A�����r3�H��A����������������P�"X.B��P !�*B9��
'V
�8|��<���"��h��U����"T�W�.@����IF��z{�S���6���\Omf�9��qD�Z+3D�j���!�k�Y]�CX�Zzqq�)qx�;���>��H0��
��\�:��
9��}04�a�����6���V��3������)x�2vm76�?��~W�w��������?�:L�����$PFA�?_vD}�?�;�]��T)�(XD�"X8m,�V������&�����/+�Z��a�>B��/m0���"�IvY��
ys��Z������}�HR��;����N��E t��~W����`y����|l�TD8� �w�H]�<<t�?�_��~���\V����Jg�bUW_YZ0�R��R{<������D�d�P�%C�:/iZ�yIV��K�w^J��T��\�^~��F��=�����z�i���-�}Os�*���)h)R�T�0�"eE����0LO�������+�
BZg,�rtI����~��������MG����~Gv0�w�_���l�j#�A������llJP8?�|�{���S[N��!�v����,���idW��0<3ed���Z���G�DN�k�������S�[�)#w%@�A22D�,�(Z�� "nA���������������������T'@�
��vR~����v$2��:��I�x� v�A�h�����<���A]�o�T_��7��?>��?Q�H�������-������`�([�>���WS;,���4#��|4sG�qd��|�j�l����F�.���0+���Dz�'D�T�3�� ��H��2�D�H4�l��_������%�a�+�p�4{{fX0l�Ni�N[�����q!-	�IVY���jb-O�,��|��C�9�!b��RF\q{XWy�t�����n������e�?�����]�@��u|�uL��
��)��#����Kw(����`���ak���0V�"
	0�p�<���y��c���N�U��C@	P�P%P@	P��@%H@	Pb	(�Ja%H@{���Nz���4B�������Y�m'#gV@&k�AU����<������4u��6�-�6tm [�p����v� ]&yc|�e�\���}M��$�>����&����G�~��4��
�v=%(TH��TS%�U^%8Vi��
te�W�/���E���[p��L=��<��+"�Wf���e���P�+��z��������R���N�L��6����>W�i��A�x~!�}�~��o}��_�������p@�a��T�#Rs���{��k^dJ7h(��P��!��!��!-kC��'V�8Ld��te�]e�����iA�99�0��h��� 5E�k����)
~S�XP���������19��Lz�]��]H��D���S����-E�Y"!�%�WKb�:vh��+	��t��>��aG��l�9�nu���b(#O�PT	S�z����R���l��M�\f�f}$�D�	A{���<�z[��bt�WF�-�\�(H����?������_SP��E����Ks��pzyYq+]\��#�nV-
��s>�r��S�����~ �q��h9|�cbS�:�����s|n�1m<SP�`���:����,�~�o��u����Qm�����\A	{TP^0�F�"��S,����e���+��'�`�.��o��l����x>��s��q�/������������sj��T��W+��o^���;YX��m2���|����I[��R����a���������+b�V�D$Z�
�L���,*�+�V���a5��+�O���|�=���z�;�d�'bM����ML�-�����9k�al�+M+M6���U�O&����+�z�j'#��T��Q~�G�Y�r����0��}������?.��``)���k��V0X��%��'��d�$���LW2���UJ�����q��M^��XV��)K^�rMf5	�_)��i��P��������$���.2�F�n����_y�
P ���'"��a2�����Ym��/QF���-ifj����W��"��7[�eZ/6[�3l���OJ�>Y�P_���?��������4^��`��y{��$������O��n�T�m�wg�F������Sp��za�����.�J���0��O����1�����2���t|���b�?t����)��=��}hw���C��
�7����3n�'�����'P��nq���
G�w�����:	��cLj$��N����J�����L�QR��H�!UE���j���&V��jhYC+!��|��k�m3��J����x�<�x#��w�xOM�h�����c���c24�c24��&3y���Jkug����W�au��VW�w��K�7��;f��>Z��|L���"����}�[<����`��\�S,�v���b������w|<�F��D����q;f��L��`�������!]�C�p�~?����������C�'Y�Md��43����%��-�E��q�Z�Iz'iE��qj�ke�f�5���	V�������/��<�]|O1G��u�IjG��~�	��=�����V��t�]���"�ti��H!�E:�.�Iw)����t��������I
���!5�cI�X�A9��r,��K�(���!�C:(����y�h63���*��r{��Z����`�?�DRk"�!��g��L��:+�N��5��[��O��[3�
P_���{p�VU�ArT;b`6C;4�3�b8�iG�
�@a
��������+���Gl�FR��L=��Gf�\�>;�A�����0G���@�Z����3h�r�q�.�����j������I8C�y��
0 �����g�u��D�~@��nx�`��vi�2�7@]l�Qc[x�����W+�j�-d�Q�,���V&w�z���.$����BB����
��������V<g14 ���a��z;um�{NNr�Iup:��d��2�,�����_�i_��O`�����L��0�%�UN>I�r�R�2p���G.������'O���V����8�����]m�����?�(}�E�?1�}J3+�JI�&�~�A�F�vK��#
+��=���H{�a:_||9_,�?����K>��&���=���`k���z>���'�l_�W�a��������zP�"G��������X t��,��a/\dS�&�7MI��"�
`{.�o� �X�I�z��6�I��E�nL��L_�����~Ig����FhJS���yiO��$7Y�U,��A��|�������.};����Bx��C:/���N��M����Id������������)�{J������)����=�����=������q#�����~CCr���������g'��\�9[��u_i�����>��
�|���u�'�\Y���}�Bc�n)�@�^�^���\�:�Vs���K�+|�;�{-�?Z������������Z���}D�{�D��r`�����S��_��=�p���V,V4 t.�0��~�nU�2���Z%0[NL	�2�>��O�����>��O��'p�)���>��M{���6O���~@�������=����w�����++�����S�T�P��^�������w��}[�F]
e6����]����j��h��u=���f{������������qm��f}�
Fv{s�9�����.�>%4[�)�;������4��/u�gf6�9�+RsP�0E��V�K���H@���
����`��'���[�("�����"�������2IE��8�_�#(�I6?)����d�e���`{b��"y���_��M�|����
��g>(��P�Q�H�BQ��"�� R�Ee�Hj�(�XQ������d�k�N����>�����;����I2�d@��#b�{�����C$rN
jGQjGR���#�����L�$w��:)B����pY ���fs��x^����',��w���2�*�����lg����(2SM#By�r|�A)����w��u�k�h����I�� ��kS�&w$�'D���T�a�����:���'�����7��5����s�/��5�X�����������{z�lnV-����|"��#)�U6M���g�#�s�b��HG�&�R4�6�)�B5��R:xg��~a��@yn�<���UD��R�t����BRx]��KS�J��W�PmlQs��w���`L~&;�{	���Pe���h�W����b����}X^��5k�s'^���br�	1^I�(y���
�IjL>M	��I=��w%M��@���%W�s�^IU������s�c;��V���6������d����T��v#��D��!�HC��j!�u��7_��r�{��?�!�N|�N��\;����ZMNZM�ZM|�&�VW�������QhJ!�1v�<�������y����_�,�LLBv����
�#UG�C������<��N�|'��	�;��Y ������a�#���ph�dkX�&i���/������I�39��U�J�T���0��b�XD
��?�rG��D��3��ZR'�����%�������������������U!/��YJ�"�<W||jpm�������t���=�+/e{Q�/�8��o�X���5�r�V���x��%���I�������]����m_��y�]o���-�5�k�eb:b�B�r�����,��H}���0�<B�G ���#x�p�x���#xd���4��@��+�'_6qI�o�@�N|:Q��HT��a.#:��(g>�6�c)t����1��t������V~!n�yka�"��BD�\ =3��%,�`���#C�	�h5�P�&P�	�h�5�@�&H�	(�J @�P���1T����5p����"|���dc1M��AWL��A���
��b��~�����|���e"�������!���|�D��E�^Q)����f$�4�wm�~�%��cK�����'[��7��9��z����cS��{x�]]����%�'��j>���b�X$���)4q&R�����
�-e�E���D7�||���'r����3!P��=�'G���T��a;��p�$��� !$�Mr�b��=��d�
�|.D��������{��y�S^����+J�O'���>9^\���[��u��C�/��m��G[��y;@�#M�O'�{����*�"���l2��q�m6?���\j���������U��f~_6�)>���������(Y���iZ��u�=L��,3��-6Q(���%Zw@����r�����2�=j��0��tR�KZ��I�T'>��}D�� �zT�^����uZKU��R�7��4T�sz���p\{�mu��	��q>��D6�w!i�y�#M)D4	��V��
��&f��`���B���
E�
����0m.��c��$�2�Q{��vg�p�*ul�yx��j-w��~{���y��N�n]"���t�I"����s�F��4�,�"�*��Y?&b���;v��
��D�R�����5�t�����dJg�`a'"��c/Q�/Qd�N�G����`3��R�t��m��������]H^�N�$���:�T����S���-���<s�}0Q���$O�k�S���	���o����R�������}~//�7���o������(�G����e���_����
D��W��>����5E���P�����'p�N�l��) W�B|SB,���;��8�*�`N���
Z�/�@d^�Y�I�h/xj5���x��\�dm�Mi��P����H���
�C��L���gnF�4��A�_���/����[������]y.�����<��Mz�R����)4�|�=��daw)<U����^u/G�+>�4�u���u�k�x��0�n:n=��:����~���{}\4�x��q��c��N����n���%���|:����tw���u����|j>���(����M��$O�#������)���������O�.Zn��$�-Z^���n:�fO��5�c����U�x�?�����U��z:���z:�X]0����t�����-b���������b|uN�~%x�P�<B�G0�A�#x���#;���a��a�e��W�M��sg�+>��y#v��oV���f�T���C�}3;��d�E��aU�E_����&'A�S���L�&}����N}�/V	��]|���v��lg����]����9~v�0`5���%���J�	�t��yy�������XQ�]Q�O"�d$�,DX��(DX�d�e�|��[)�A�����9���
n�����Ow3{�,h,��S<OS1���:�|�X�1���fN
���1��*������u����A����Yv����e��K�$=�Iy�����([+����e;a~��O�?��.�Z�Kw�������>i�Ax�����U%L��!lbE,�q6��L��;n�v���I���T;uV����P��N�)�������ka�����)�����aR����j�m���Gx��������i6KNw?rKS:�tO�-�y/����&���a����A�m\J��r��:<�{=���8�.���3�G��5���H����U�z*3�GP���*��D�z6� >�r��B�� ��B�T�=���B���G�q�&�s�q�-�a�*ZyEq	�C�;����W��<f�s 9$$��d�L7���a'�>b�n������+�7�����c��T�J��s'���.������63}��X��jS���s�+�g���s�>���Ye��1<�)�������Y�����(����v��/���:�j�D��_�`�r=�����������|��O}��&���n{�c@���x�����f�V�qd2���O��((����C�j����$���o�31�g����PJ(B	D(�%�"� ��Pb!�@����O���+�IW��Nn�+��n�@�����p)��}����d������"��e��n5����.�������U�
Z��Q��� �"�|�fv�c,`���������8�����'"i��j���R�,����$�L���2�G�Y�-J5���/�x�$��7�4�
oYY��}��Q9JR_.�j�y3�����eR,�r��$��j��~��XzCE1���,@?�V��#�'qf��T�/9�$)�L�X�T��~K�h���7�t&�����{�Y$�������O�GS��P���%�t�']RK�,�!�d��ni�sF�����Tv���^1���"�P��<s�6��jK���-qZ���m���c�����]�gU�k��|�V�����=F�}$5D��
tC��!����hU�nJ�d4����m�����O��������`�Y���*���JR��t0O��S8:q6T��k]&Q�a;�S@�S!O�f�m�V[�s��,Cn\3�x4��f	
y�;(��N8�������M�U�A�x
��5����c���
����@��B�(z�g���TLS�N&�P;N�M���%�Ue�.1m��#zj	�.ag���31�r����������*q[���q����F9��X�����MP��M.�w�������5�d&��\�������&T����JDt"�@>�Y�X����G-R�|�����H�~�L�x,Reb��c.�����-�T8����%�E�����
���v�;^���D'>����{$&�3�Q�db3z����)T�������1�ID���������Td�p���k�51�������&���shO�U�S#^�����������.�J��o�0��)�shV��OKCi�h*����yIv�����3�%�F���C��6�eS�-c�1&��&WK��Y�c�FP��7{�<�6�n�����,rd=�abQ��4�R�69��x��T��O���U-7�V��u%�,������?������i����.���j/��O'R��tUk�M��T�#'���p
��%�N^�/�����W��<�
��q� ���d@���������a��Pa
o�c�s��������������x/y����M2~��F������x�n�o�E���2�_�������:����D��y�Xz��zX��sQ�(���`����c�(��$�t��<�������\S�����:�A��y�v�~�=��j������6���}���8��qY���y]��v����+��6� �s'<���;��Z���>U^��|�iLH�f��&����r�����n��Z��ODL�m��T��L5"a�)������0�E��M	[	���W�������T�t��>��p�o��;�����#K ����i���W3i���[=b���G��#h�Z�~`;�_��e�m�+{�U#��-;���L����=rh
��H���Sw��������y��K&W �o��WQ�����{�����|��y�]��O�	`<��U��_$y�U�C�����f{V������X2T*��a�:gN�	X��V�e���jp�v�'��TF{��~3��h1j�}1� �v'mIZ`�rv�Q$����c<C��'���>e�zZ�;em��R?a��0�c�5nTcq6:��cg��F�����Bb����X��G��Vj�rk<���3o�����l�m��6N����������.�=�to�x�����z����p�O��#\���wN%\�%\�.���\�eo,�k
��(�c�V�:�V�	~��
�lN��������oz�1_���(���a*���F�4�t��ZK�$�y%6"e��v���F���M����FO�}�}
���r�V2�6�\3o@��_J4\����H���rj�sm�'@���iK����bZx�69{�u8����X�W��g�M��?0\r����l���9���P��{}��k�>���)����������6K(]���M�������V����PE��hw�P�(����E�?W�����ci\�S��mM�����X���:7��s�4������a��k��j����;����<]VZ�PvM6_��<�m�GE����a����:���<�m-�&G�������<����h��X���J�TDUZ��Q��i/j�[]X9�^�dSr���)�]/'��7��q�m'x�<��mH�m
��N�����E�v�����T�g�u�<�`H�-YD�,"�����G#�R���D�2�G<�K*?��!�<EY�T�E�iS^��D7�pfSI�����b�ak����W����J��#V�S(��|�q�p�f$��nH4�Y�;FA����tTd��1�%��C���hvp�GxLv���������]���Zx����/G����%C���n@R�F��H�iF��pNo���*���}��
�H��nR�K����WO���nN'������,������y�9�?H'_����	?�����.�UH*�|�t|i_gCLt��F$�:{�g|�Y��4��Em_��UC��d�N5|H���=��~���L��6����M�u/��C=�MU���1�~!�P@��"F�\VyJ{w�[����g�w����U�D��>�WF����V�8�vk�Sx��z��b��d��	�I�G�8��n����d���.�0�;��Zah�KX?Pu�����|�&�'MMS�Z �]�d�����d"�$��@���)``y��[��V�����w�|{|�����'�BQ��� H�=�g�(��V�������<��]�^rn�j��VE��fJ����/#2���*Ae��&���*/�m�N����<�����3��.�Wg#<���i���!�-[�~���<�����I�+��,��\���q^*'�Z]��u����g���@�����|2LRB��
�����g���P�T�~������R�������sb������n���S��%�	~�)�(��[!�y���Pi~M�H���������e��������974�,�y�l�V!�G�>��D�@�o �70���
<��0XZ',��:��.� XZ�x8=���aQd�"�p���!�����	���s`����h���B��E��xpy\����}!='vM�!�4������'��h�	�7{~U�DK0=�xy�B����oC�U}-�;P�Nc�(�2�)�n ������$��S��y�h��Wz��'��i��i����l+�/)����WT4���H�H���Z�g�h;,I�K�E����u���fd(�����g����Z���	��[����e��S��(�����>�r�q~��?�r�����G���������.��c ��t�F{�z�,�v��a����z��B.!��ZB�M����m	��������>�lRoK5�~
�jC���'_�M(��@c� �yC;-�
�d1h�T���e�x������4Z�(���G����FB���Q��B�11_p�,t��zX�/��������Q��B����h���<kz/��F���RJp�y������������
kQ<�p�-�D���`Dt�������H�hfmc���iF'��p.���TD70�����L��N#%'t
I[O�p�����w�/����7^����.���iv����m��l4]<�}p�p+����R�C�s�'���it��x���l�w`	���X�Sk�;�g�Pb�$�*C��SGeQ��X�IK4�������f�����j5����g��e��;���	}�%��h;=�N����Z��	���tK V����-��7V�����t�}v��������m�a��6X��_1��>fViPU�`��6v:���
gX�^�u�&�f�.�*
X�i�3Mps�	$�4�o�V<�3��xp&������	{��
r���a@�<���sO	x����$jw3x��=��}{�����\�����*�����tt���< ����C�!���t5�'��/������(��
�@��6U�B�~���5���PQ�_O_����3]IU"��+�
�3=��DX�y.0�����zZ��b�I�d����CK��~By�H,�p{1?P�MU+J���:7��7�T����\��l�����'���hNq�cx"	 Me�e�-&�H�����TJ~�tnM������r�OM.��&�����+�"�C��[�9����o��Z����{�*:,��E���c�r��lou��9yJ��O�Z*hcF�Y�VV��\s������2�+z�4���3��nFa:�Q���S4F�q�6�W���"�|�/A�������$���F���<��z��o"������u���Db���G����9����z���$w�)�+D��~Yo�����x�����(T�,E���P���o���U�.�mn��9�\�s%�8�9�'��^�Y�1m��e]a�f�����!;k�|{W9�(k)�$�N�=No�{J<��Sg.~b�&���~ �@���=���V������������!c�������B��K���b`Q1p��T��X�Z,�W=e��H")��]!������%�xZd5�����s�������R��T��)��X�v	��(�J>,_����6,=��t���h,�� ��Pr7�����&�>U�C��QN�T�=T�A8��������;��W$,t�����E��X�(Wwf���<�m�9^<e���������!�UsQ�Q�e�{��<w�T]7�u^��H�>N���Y}	?�
�y\o�s�zbkh����j����Jl�J�J��,z>l�B����BZS!2T#UIr������P�S��
��nj�����x.0�;����v9�B����Z5L�����;S�D�V��W:�]��������FP��HJ���7�z�<��mL=��x��~�D�h$�[�����l��HU���u����eyHuH���x�O�5��U�����/k�=|O�	'�"_w}L�
���x,��k3��E���[��h���m����t)�cW
�=y
��W�f�ub��2\�������]
��t:�8���)����Js�Y���/��sWa[����4K�YU��:��\��S�)��@Em��H�h�\-��a���� �����|�F�������������<��4h��a6���U�y�%�����T?����;��i��P�R������V|�z��U���\<.�����e��2�s���x"�\6���(�A���B��L� )?����_�"�i�������
���h����X7	�~�������������D�E����WY�`�Z�41u��c�V<����*p�#����(4�%�7	�:���n*�6���2^/��b��/����lx8�f�q����Z�|�4�����s�����-��b�	�UR(	�O�u��L�����@}���j~�U���M��������hR�������hoJN����Vn����_D}f��\��=!�+�V����3J���0f�*��B��B&v�-���U$H���5D����,. ��=�o�|��[0���2T>!0�#��@��%���q�����������D�jT�N*�l�#��x�����e������K�x�&��&/VWh��t���		>hw)�_}\�������(;VS���px��1�\��p��cC6G�������+��Rz���KYQ����/����.r�������s'�(D�x8��Aj=irg����EE�
F��?�/���g�q��bl��r�����m&N�p����.Y�j'�
�����������&{�Z^�g���Z_�~���=����&\%m�A�
�'N��S��VS����,����a���JC�V�gZ���_�/�����w2��u���X[e9���x�!�VE��k�"��(���-�������8��:������:�7���(8g1]<�����S7%�p��@�h�}��O�����x�zo� 
�>�4�Z�B+dZZ��n�6q�?���R���r,����5x,dq���&���^$L��������\��'�@
�sp&��KI%��7����@�C����]K� ��+&���5�O&>��_��!l��V����&�xh��B�P�c�r1�d��)�3�D�|a,��
��'��8��}��^L0����a
n�Gl���,��I�����Jj[�l���c�Y$o�U���x�<�VY�7��Z�u*�?R'L�����K�	���
��Z���*��`_=��o�uLK��|��9�7��H&3�6�/�d����)_�&zM�C}V�5l��C��Q���yl����?��s���M��L��Q�<g��o\�.IF\"/�j�[�qynV�������%#��1A=+��-� �o��X�\I��;:�h���Q����MT����m4TW�������^m���lj�Z�I��l�Y�4��Ef�*��#.��rY�����:��������J���U�r�up�^SZ�0'�a�_ZD\:�����Z������C�92��2,$Qj�x�����[��S��F�3+hsGWI��������E�5D;a��=Bw�����Q*���a�yG���1�`��� �5�(yUz�V�7�>�}��	us��rq�d�2����w�<�Hqt	A�Iq���6�K.,��u����X�yM�'�.g;�������lW�����
�����o���X��q*���E�7+A���i5e%)����A�N����y�������rIiy�YY��;��k��50�c
X��YN�K��8k��53�?���/���3�T���ws�k\���B���h��_X�k���>��������[|PM<���t�qq.�pgw���m4�$��#���hCh8)����8�fL�+��a��������rY��+T����h~0���|��$�q���1z��	�b��K�w/z��E�xl3���r�[!\�~r;�g�_������]���n�r+�V������Y��	�|�;����?�y�d�u�����l�����O�.�U�+��V���UTs_������*XU<b�$R;���)-�$
����=�����Y���6U�r�Y��2�����������,i\+0�,Y����i27�As��
�?�b-�L�2|W���sq�����r
�cN*I�zj�('��8�����O`n� 8B:r
�h7yJ}��hw���)2:`��4����S,���Q�N�-#�v��=� ���/���
��w,^C��H���tO�Pq^a��4���b��p���w���RU't�F���������8�,I�t�������g/�]��e4M�%4J���4����2�
fJ���rE	=�c�@�gP���g��+i���G�	�SN	d&$�3��{��s���g�`z��X����W��s~��V��[�{!h�$��nB5�O�[Z�d^�r�)��Q�s?b����;�
�i$XV�ye0����g��c�k��,h�q���i�����7v�,]���W\��P�$���������Fq���Ku��h\R��$��s;�Q<g�Ijg(�Ah�l��L[n�<U�]�������dS��9��� �n��e)9a5aq�x�q|,~<O_�p���p�R�F�=K����+���{f�L���c/<�$eV���)�2���,VRih�t��q%[��!�l���,dT���I���5��&�a�%��?�	^�p�}�������t�'�4< ;�G��u��1�-F�=��Evq�v���5���@o�>g����g��}s\�gF��)S}|���RUe�y"&�L5��"������[R�����z�F8�����sd;�~��U��1 �x���4�\q���+��r[�@�X2X*�	�^��dg+q��3:=����[e��,R��_��>���?���lbx�g���$�}X��~A�)�7_O�7g��=��$��j%�?�*Cm�5�
����1�nz��6[�`�Jc����?��X�o�h>��?�����g��TQ����`��,a��O�������(sV|p�V��>c�G�����m�	�=������0`GAT6�%9VMw�����������sLFa�kM�����1�`mEjS������$|�s���7����ncd���Y���h
���c�|��<�����e*:���J��wT04�I�8>)�q7�(F72�=�����z-� g%��Q�������{)�����[y�N�4Jh�N���e���	�������P=�S&��4�
�0��~]r�F�3��������}�/C�'��q�(�n�@Q��)�Lv�4�c�G���A^i��`��#a��W��%���-�'��Q��/�#dip�o��W�`��?n����K������9��ng.�n�!8hn�?�|E�����a��4�S��-Z�����a��J7LiQ|G|B�;3����)�$�Bf�
��K���"/��G�x�K�$��h)D����hN���D)<
�Z7�x�^d%���,L
��z$,����|��������8�t�����������g�H�6�����<�+U��9��z��Zm�TW�6Q������UrU|q���
�����[h��j��[���=aI��9�Fu]�c�0�W7�N)^�2����o�o����\'�24�1t�����(��8���������:�zzi8���4]c�Y�v.���$S��&���z|�q�w'B�qR|����R�>��2��"D%+����!,F,O����w��;��*����F���qL+����_q����&_{4�`��*�����L!���X���fZ�g�m�'������u��������.^���fP�F�-+ >T�E����?�r�er��ft������PK��6G�.O)CN_perf_reports/12GB_preload/ps_4_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt�k��4�;�b$��6���$$��S�!�����
�&�<�vy�w�����&3v���b�n��O�yy<���;�Bn�*g1d�Sx^1x��`�`8OM��a���w�en<�p+>�������YX��M��,��r�ii����v�p��5�,_���$K�j1[V��E�G7��B�G���h��Ttg����]����obVD��7i����!���T����_�K��vt�����_}���Q�&_������O�'������Y�a�6Y� �wm����
�����x��GU��L�����Y��1\]��}a������%�\���=DkV<��a+>�O ��Ur-���������#�����>|�ey��(�����b�g�	l9�I��d��=�������_���_~���?|���?���w?~�$�*�
v�\�'P������I�]�M�~��SZ��(��g�f���)���������n�����������A�������S��1\��-���#����y���):/�nB�?.�dQ	���a����m_���[�?��u�����l�|^��X'����S+���$m`�����������=���M�������ok��o���e����%i����
|G�mQ�w�����`k��~���[o{7A���^&�MPq��X����
����b���\���3���f�(��Q(�Y�l��������w'�B�'���gAZmX�DAE���;�s8���9�����K���	�4���l�A�,`����Ss�L~�����p�.6�����?���nX�aq��Z
�6��;5U���m�U�l� ��|P�?e�oYPm*�kE�C�Y$�i�F�2�wJ�Z���H��z�Zqve9P�:��-��m�^�$�W<��;\�d� 
�����7AP�"�J�~��#�����z�9���|�O>�&���������?�-��
��6&�*�u���bPQ)n6���(X>��raL�
�^8G�L�����E?����:�W�������&i���F�Has�C�2|��ZF(� Q��B�u{���k�c��H2�IJOI�,���Sd�:MB��4Y���'w]3���I�w���3�+%	�sfB�	�B��#I�W6�� ������E\3�������hZ�����U��(��"P�d�t� E��	����7�C�g>������n�q�H��|(��=��]
C�m��t*��?���?�	|�"��R/�g����{\�]*\�T*d�NBr�zR6�b2���:�}�r�k�-�U���N�T�2���zh.�9%x�8�U�����u�!i	��HZ,-��IK��%���h	����(Z�^Zj��A���@j?7�T=�!��N��v�eK���[�,�e53ZOy��<s���;[�
V�n�p'\�AY-0�v�C\6,���6��j���1����T\/I�YY&����bP���YbS}�7x����8���z�<L��0���������W��jn��r|�p�?+��}���&\?o��S���s��ZPUU����;����`	�z����/�/j�$���j�����<h���j#�+G���S�|��yq��._��&����Y����d\U��������5cH,z��<�i�����Mf'���Y�I-�$�2��&g�I�m��#�ud��"lh��k�����������=f��4{������'�k��Mk3	�y$�7���i�����/G�DPm���-9�12�=�`�1��������G��lU&��G���� 0_����]�o�q$���n%�>�MN^�u5����U�YO������J~f&bEQT�W&��Np��S��More@{��J����TVl�A���&�95<%J;�$�5DW;���n���r��y�� �\�(4r[��w�U3eT���Z�.����G01_��|	S�%L���
�&�����t*�5��hsv05�O��p2�qF�{�"B��B��o�FJ����H��2�������X���fw��5>Z6�h�Y��g������P@a�U��e|F@_]uu	R�K�]]}u	��%PW�@]]�����{�J^uP��������p����ef���=<��6�I�q�dF�������l�7����t��>��������p����B[5�����!uQH[	�-���|���e���"gY	��0�3��3-����
VU��v��8�8
v������W��ve�8-���o��?�����jF
�.z��C;����C������{s�Z1�G��X���8a���bw��95h�����~��f��8p\k�a����ab@
@���b_��7�����E�G�i�D*.^�u=<�uS�X��|�<�=n���!�}x�awB�W�Z�xeL\ma�cb�����^�2�X����K`biNgL���N&��\aq]v��_��c����(����y�%��g���1��abv�]�}x�]�m���/���`��}�]��x�}x��2��}[��&u��W���uhT��6����E�W&^l��>������8�>VS��-:~=lh���5�2��nGG�����;c2����q���}t�@�����r���Y����.�#��E�-=�,|��L��e��^������]��4��u����_��]���+�37�f������0�{+�b���{�����o	��gkO�m8Gttr�[����n�~��s��GE�|rr�e�<���K��;�#�r�����W~R��~����	g����j�\"����������s���_7��;&�w�c��N�Y��4*�^_�hA�c�pL�������d��b��i!Dg��B��6gw�=��|��j�g:go@'�**���S_��LB���Y����=l��S	4%1���hfxj����ex��m!s�f����t~O*1�X�x'��e+�u�?J�u�}$��$�r����i&��0IB����*gtZB6�3�����V��V���>��}D��~H�p)�
��O(�o�S!P!�'�%��x`
�i��y4�W��$oX�������Zl�Z��f�a�0�����pFV��W�;�Ze��^*S���E1v%�^�b����.~u;��=9���W��T�
-4a�'-�����WL4c	Ltct�P����r`a|`5��;�
�v~�b_On#\����m����z�2���f�;�i������a��!��J� 8��@
�s8�{B,��/q��[u�����Q\�I�H�����2TB��$+Y���-��!���	��.������f�[��i\-������8`$}1t����ff���_#�'W.{����&����
�q6'H%A�&3���$Q*��EM��He4�,RK�JEX��-�\�V�@��<�����R�j�r����(���������������qWo�8�|t�es��	����z<�B<�t<�d<�D�0�����F7a����k .����?��7@�j@�j@�j@�j�L!���xMJ��.E7����	�/��uE9,"\Y�)7H���<��9��6�m���|@�d���IM�
b�.���y
$���4P�g��/�Rp���c�}?3<����2<���o:	�@�@��N@"��/C����Q�@"��G �h�Nb�M6`B
�X��caEc��!�m���0�����o��������G���N���E���m�_v����z�9���|�O>�&������eHb�/��p��+�0�3��.����C�]�%]o���'��;�Jk���
���H
O3��3$E���[�&�1^�Kx	YO�d
@Vh���41|:$��C�f��6�~Z�7��w2>��^s�_%y���y�]E���a�is��`����7�i��lWY�����
���:i&��	���L��
ULA��G1���������d�]:��;��_��U���I�SG�?}/=p��9=[�����R�������+h��\k!c��v%��yg��P���X*��cv��P�n�8G�\r
V��=�!<k�e�����4M���@�_���2�1M�:OB��hB)�(���RNkJ+P	]JS�d
�S�2oz���rm&�7�\��o�%�����9�K'����_�t��7���W�L��z��S&��,o��)���b�>N
O������{���b���MP�����3=~��b�[Gi!:��Z��p�Q@�GD��(��#.LhL�����h�-A+q;(����L,>aB�	S�O�R|�+�0��Dz���!:������	��3�"�C���4M��t�w�aB���:�aO��'V����)�D���*XkuPA}{������p�{ ��5�Ml��U�O�Wj�J�IT5!�~���P���-u��l}��H]�-R��O�&Wd'J��v����%���iHM�
b����������cA�a!�!�TB�I�1�M���`f�	^,�5����n����3�[d
8�[m@2��j���Q�mQC��[-Q���vp<\*I��y�Y����py�~����C�B�s����������b��U�~���fe�mXX�y��
��0{kM"��&L��o�Q��!������0�W��M��#f���=!1J�C����l��D�B��<�YY���U��"��!�a,B��D)N~:���h�}��>p��o�i�@�d�'Z>~G�<������tkN?���AW_��{f����|��V�'����B���/���+�HC�<{��G��8DRH����=�Sx��e���=��G�*TY�)r@_��R"�R%.��U@���dYS�z
���2��!i",d��^�*��
�)����@�}��}��}��f �t�J�]�?�$�m���������H�����7����U�4�G�r8��f�03�y���1i�I����J�"��63�6	�v ���H�F}��,����S��g@���X���\��7��n����X��]�(�U���u���������0����=�	���Ca
�;='s����N���N���F`�����@gx`0<pk,�X��\c��"��E���4�t���������#����w�wM�!��]�	�+�V*tbg���9Lv���L,O����of��qJ��j�}U��~�d�W���v_,����Xo����f��������������s ��r���b��R��Lv+��W��%V�AU���E[c�^��yA��
#�J3�,�^n��z%dl��$x�]/�%��HF?z��*��*��G�����+5�q>���8������sT^EYi����������$1�@��������T���2j���Z����0:�0B3���lf�9{��ERdq��c�����s]VCG�fkN�ZR��.�J]�����I���K|\^���^=�I���~�|�����������=O�yz��{����$t��Hkk�`K�*F�3��)��LA=g�L�;�6=;S��H*%�KI*��-=jz��h���#��G@W���=:zt�����u�Q����Jcp�jRi2�~b���L��a��AM$��H5�z"�D2��dP��+�AK$��H��|�lXjrk��+(���=��R��~p�^���OU�3�hVj�(
!4ev��8�Z�23Pb��
>
+Of���9�'��$l	
[�$�!<�o�mY��b�O���x2z��1�:��V��y(_��5T�L���DS�@=O|�Ll��keHb�$�����,�&��)���|�������(��8��%2)h��29�e��!H���A�V9������
��.�t��A��<�M�hIMz�p�~�S��G3��������L�%Ge*�"L)���`l�D�+��6H�AH�L'��[v���
����oS38���,�e0*�;y,��C��8��f8��*i��4Co��z����/������/;���7��aN3���7k�Y�sj��D��Zsi]��6�7�7�������6�8��$�����:wXm��v��f
@��I<��/2������7�cpl��;����.��
��
���T'��|�^��Z��������������aF�;�V�:/�_v'�(3:�3L�|���nz`MW=�C�),������W�F
�2��#hox`�
�[����_�,�����������r���Y[z�Wg��3�����0)����Pi�(	���W�>��!��YW���gB��7/�s���g��uhQ�^�+�}hVeQ>�RW��=�`?�T�����V���7'���1�a:!u:��vw�{�l��{�O��1�Z���?v,�p6�M0x��/<v�B�W����j�m���@�������s��G���9�&i - ��K��h���F�h��+^�$����s��\���i_�2��x�����o��ckW�������\C��L���R���2�;p��c�)�S����9c�yv,mc��s�x��,n�"�7�3L'�.�{�Z��u�_m?4����"O��V������K���"�{�w?(L����a+u���<�:��{�t�K���.����;4��#�� ��p!�`C
*m�����>�#W�M@����J�8� ~0��	�)*�6��yZ�_v�z�E�����P������xf�7+5�'�84q�^��T~�N�0�6�P:#��	�3�5�d�C�f$+I��d�N��)T�n%�Y4������WP8������������2�"x�n�������v����U��n����_�0 �b���#1���Q���T:W��&�=���k�����g
Lf������c����={�����X�#F �S�����&F��"���Q){��)"�Lb�u�h���n����F`�����cx��I�� ��$2��(8@w����tx�`;'�;��jl�]�*?��We������J����Kp
�Y�S>���C�W�u48�|�,93i�S���h9���{���jW�Go���h>�M�6�5@V��������z*&7E�kG��4<twf8K+����S@�)���{
�=���{O��L=�����~`��V�w��@cV]&�f����s������Ru,.���0�Lp�}��'��BgJ��@@��O�z��3>,L��r$��5�!��=��M��1��g%���������9p�H�����om}1�x�����i��� ��LM��t�u0t�u0tpt�uPu�tpu�u�tPuPu�=���&{�WH �LrT�$})�������1�fD?h~@{�
������]7��/�g���`�
l|,�.[�-��@e��D
�&���_�r�qdS<�d��0��~\o�	�'����������^gr;��d�Q�����'���B����D�_y2{E�'6]��,���L�(�}�X�O���/u��������cS$_����BajY4�$�}��4�<���"{����3V�����45����n�s�f9=�Z���i����N�&<=�Z���AA�WA����`03�Z�p}[c[���������s��������������A�-9r`]`��@�d&Vr�%^9����H��i��G���6��!� ��E�,4�����8Y?���d��/�u`<������(����I[����4�p���(�8L�v����q��bj���@�����>#N�a��#�gRm&
���>�d��p#V�+�I��V��.��|���k�d��Pq������Tw��D�D�_�%�7W�jQ�����-&|�	<����D+��%.��uG&��#t����@6�������}L3~@����������oY���-�M�!,{�'"�~5U���w8��!�!�o���O����[S��g4�$�����l(��~g#��s�����3�W�cj]��
��'[v��gj/�]�����o��n���tB�
M�_�����wu����w����M�����sjfoW�6���`jf?��7��:�C<���j�aH���xD���Z���:-�������	M������@I��%P@(`	�
H��@�D��$Pp(���s5$w��y=`
�k�g�i��H�/
������a�W|��k��gN	74��|��������4iG[V��I�o�e���q��l�)a�1r;������tC��v��hH5(5�)o��+��<�mF�}��z�����C@u7S�o��FH��i���%�G�����{73�k��s"y���XR�0�������'�[�s�����P�W��D����$�F���k�`��>n��<m�^���z)�������_��>	4(������'�d��I����X/�u�\�"F��b410��M<��� i�c��yU��7�W��e�y�Aa�3>���ej��\k!�j�x�m��8�i�C����n��s2�]}4�-X+X��)+��r����y��J�Z�[�$���<�����8�#I��|�-iMHN�O�	���@!!��$$��@����t�l�r[>\5�!���|�Q�/������8����5�U�Le�M�,��`�|\,���`�a��#-�"��;�z��V{����o�G��fo�:�U31��Kmv�`������~����s���tD��AA8��y4�I���3���������lp����V^��#�vX?b~��#��X?��m�;V�k4��Za�p��!W�0`�BOv4���L7�(��}�t4������0�r^�b��
�XyB2��d
��)$SH���La!�BA2�d
������n�\�7����'��'��xR��'�x�h���>���	!��O��U�G�}���Ou8��CY���+`�����
��Upp
\����wPT�v����:(O��j����\�T�7p��{����Z�#�i��"��-�&�e;��������3��b�gt
�B������X����'�ci-��y�Zv���E=D=��zh7|����HP����[4�A��r��,ps����m�w�7L�.	vL�b��eT���D����$veM��.�s��:�3vu�Wow3��#w��C�u�N�����O
��NF�#��BP�U��u4
�T�����]S���(�V���5��!���Z�����c���6�0�(Z��� 9��C�&�����������,j�8��BE�{RnA��_��B���(B��m�3$#kr�5��
9�&������du�[l"\��%x
�/������`vxd'�����3��z{����<��������]
��X�r����V�j��?��/��Y��PP���M�.`����z�i?�����jyt���a
0q��&CN�A�`
��<����"q8�����,3m[�gl��=M��/���q�k�&����a��,�*��X���:UI!�y�6�5f��u��A�8�&�������B�~�lA��H�����W�BR�d5��������<4g;�a���U�9��+t�r2�xH�?/t8:���!��s�f�T"�o?����;?��g�����e��0o9���lH�l+�oy��c#��<�#t��4�#�V
^
���C�"4Hf��V6X�����t�_��)y0e"������D%�E�(���b����y@1 ���/{|����SI
��@Xy��?�U��+7��u���D
E(z�G7�a�0*��<��nggM��7;1e���S��!���?������j^���fG�]�#����#��R��
P
��
�;?����7��~�������d���S��J�������O��������3���q�4f��>~��_���k���/��F��H�����=@�Q�P
�2�,{��*"{���O�����>A�O��k�	\}��>A�Ot�rn:��r���W����������v������y�i@?�����U�K^���V���QGX@�e@��J�g��G)v�#L�<B�@�#T�<� �#\��%x
�� �#�����n��������4�F���\%�=�J�a��j�x�w${��
l�����s2���m�l���t��<#���l�
+|��t���'D@��"�^G^h�|,�v��|xK<gN
Bj�E��L;?�[�{]����g���m�=������KGa���6�z�z��z�pa�2��{q����P�<��z(����c*��Sy�	)*��8b2�-x2��'v���;��'�;W�L�������5	�&d�$P�;��j��a�$s�����H��8s�e3�CY���~��h����n��H�$���P�<i�&�P�d���U�zH���Y�o��]�w_k�,�f?�<W�I?����}���^��v��mw���R/��=����C]=�WU�h����c>�qm�r�}{���M��i@?�0/��c����7�w�v����z�?���YvP8��h��fDz��G��Af����^�M�F���j4�M�F��j4AM��K@	P�(A���O�r�����}���~W�_�'�����_���`P\��*�V�I6H���Z>���E9o9�@�*�+
�u#�*��+���V�q�������$����:{f��!�qY��Emg*�{�}�!A*/���B'_�;5����]���o�����o�����CdM���JC�%��q��N����t����'�5�����M>
!��
;�2��MU��gq�����:�\E�Ym�X�)��d��Z_y�B90�{>���H�`(`M��X����=ArG����n�HY�Z�a��Y����MHW���������e������5��i8.�B�Kh�z�����������m3��XPt!.�aZ�Y��L�\��=�s�Wp(�\����E�����6���b�HU�����f6��vY�,y�Z,���
Khf*�t�����C��M���(&�S+a�>�=]h�k��b0e���_
<�j�j��}�:B;�
��!���D<�H:`����1.#��a�3>�����)N�>i&��a��|������6���W_C��\-��v��*�iH� $��y*����k���|��e\1Z�^@�C-�TzLW^O���3Ji��Z����w5���@����#�$M������7������&v^�)���d��c�zVv�i�`�����gW���q&��-s\��%X����G�����'�r+z��6S�gX��#����Z�����I�� u���ep���������IYe��+iv���P,V��A"]��X�2s������j�&rkT���R�B1/x����v���=��y]B���!�Ut�]H_���t�0�M���fB����a����a[X�on�����\�(��>���V'+�[������GHr@HrAH��q�����\����A,� �zW=��K�*�T��}������:���BK6.��c���QT.a�$�M�T�)�_�.��.��G���G!���(4`�r"w��jr`;����f���`X	�i@�`0��4���zc��G�����<{XL��:U�"����yk��������%�����D�0�IP������'�hX���Pz����vU�����2�
a�z�V����^�CTl����������x�}���h�����������Z8�Z��Z"B��e�-'�����A��������M0Q�2���7��7�@���Z��H9/|����}������Q�GC#yw�������@'�����~���2�,F�������O�I�U�}<(uf�-G��Q�$��T�>Z��#����Y]@�o]��	�t�
n��8��6z��V��"�L����*ou��zx�x+�����r9�%BO8|��?�3�13��w��P�E���$M������������V��`�4y���
/���h���,J��&���dxe���G�U=�N��U�L��Y����_��i��i�����2m�u�����
��;����l����|�1tw���Mp�H6m	�P%�P	%PB	�P�%�@	%HB	�PbI(�Jv	%HBG��S���XC������ei�;^"'�OK��y0�1n r�y�w@C�d��R�
g��`JxV�m`���x�'����Q��Nd���������D<��SV�����������Q������()��I�-�i��I����e%�B�
y�����X(C ��O���;��Ji�����������^�}}�1�R9�E&�J�To��Jg|������7�A8����A�^����o���I��y�~W��J�d%��B�f,�:{P�F������%6j����;8��������0���?{x�~G��J�����g\��$�Z���0&v��7@�����{���4�� ��/*��J�����ZT���J��JbVH�1+~�����+^zc��c.�"���+lZ�z�/��:	]�>����wz}Q��fO��$�.��.����=�?s_N���\R���6��]Xm�V�Yv��!c�.6 ���y���8���{��!7�`2A���wc����\T�l$;��7�p�I~m>�S�<����^���/�>�M��O�����D,���w��g��a��`�QJ���*�K3�����V����iz!�W��0�c����U�A9��`�x��{���<���48����]�L�9����y5�������Y���������4}:~;�}��Q��:)2������E����v�S��x��=N����fc����::��Rx����p���r!�T���'�����?�������$�����LT���������N�m�?j�
���A�)>C5����<������~��2�����jU/����<�o�l7a�G��#p��>{���>��r���A����7��su���u���^�����ay�Ks�>v� &�u>�I`�l���L����j����
>����i��=}8��g5���#p�	]}�W���'h�	_}b�>��O��'h�G��f.+u�_=9�h��5��'��_�/�����������{�N;�g�/U��_�����(�G������!
���d_=�V�hd�0�+)���2u
>��/���B_QTw��G����q�U�K7�t4���T{|���z�^,�����d�R=)�P�H��S*�����d����J��ho��I=�*��Q+����aE?���T�@��eG��`�*������!�^���!P�
��oh@K��X' \aL��i.�,��meB��������X��\��j}���1T����r������l `����_F�-���>���������y C�����-������k���6��F�I}[��=*g�8C�
���Ml��&��LH�
�Qo�V]6p����m'�Eg3�`y��m'�&�=��}@�������2�@%fJ���0K��3�4�4B})����C���iI���y/�8l�R{�uU&;U������f?�i�D�4{�D����.-�\T��X�N�(G�d��S�:�W�����We���k���{�aw���5@5��!�A�[�����(���q��=d
SkY3�!k���23��]�	�U��=h��y����-{�7�����"9UQ���Y'���O��'�8J,%G������{��j}���������g��r��s���R{��N|W��4P�^�A�8���NE�,?����s�;���_�VN�>��<Zj�]|�1����3��@
����y��a���}[�3� �5gh�������S(}��q*_�����1�&��U�vtY+���i��N�7�e�WR`g_�,5A�ESD�P� B
���]5�^�
V��]w\��]����A�Un�������R�P�l��O��L�w�b��*��"��>�B?����?���?����G�z��p�"zT�'�wE-�;q|�][^���s-��>$��V��!�p�T����<<D2]lWH��1��P�,��<X�I���yl�u�{������,�Z�aF������b�S��{:�w|���AG��ji�?�$�v>��x�
~wI>?pfOl%[��������vv���wX.�Av{�s��r��*�$����.L��pH[��S����n�g�a9q���uj��k7]:�(VE���+>�X���C���M-�]���d���&���D�>��
u�MW<�����Mw����^��<�S����Q\�,�$�d%6QV���'o�|��H���_?fM���:c#�
��J}�7C���^.q�����Y������t�4_�����v����Y0�yJ-�%���C&Q�m�t������~
����u��;��"�����V�s>{	ka��bu�q�R=������?ZI���f�/�Wy�OR����>�P�U�c�{���(���1r����q+��H��O�{X�=�ntXu��������?�i�(3�\K�Q�&�����+H������RCF���!Y�:#��>� )���Mv�����[
��k���Lx+b��!P!Y��������I���B�A:'H��|��n���Vm�y.�s�^K?�v��4Z5�����s��h�}�J9����`p���q;H�.�FB��R��1�P������Z�tv�X_iO<i8QOZ��x������a�+�(n�|��Fg���C$��/@L���>�K�v����+�&jY	��G�P�|����-�������v[@���>����,�P�r�Fh����t��t%TP��t�iMd}��>j�f��v�-�P���^��w��m�����_�ON2��y	;�W��3a�.�fQO�o3e�$�`�x��@����p�kd�������������J��{�z�^�������"�z�^��^4�tj��
��64��Oc��r�'7��Ae��h�G+�u�(��q�(@�lpr�]�����jq%���S>�M��Ot����i�K�u��M>M1f�Y���-.y4�����A�lC6�/������p��������{z��F�E�nAu�_�6Jv��2���� &�P�������j��n)��'�E)N��S-m�������J.JiWV�j���:�g��/W���B����>�!�6f��������0��W���@������uEV��]�$.NR*��q�$vjO���s���3/��yx-��m�U��C��Ef%]&�L�,���eQ
��g�h�Y�����&�X�F|�>�^�7x&(�7�^^����D�w�����g6��
���!���g61����~�	��h�x�.��VC�1s��H�Y��p��@�'\2 P2!���I����I;�=����5yBC��1-C�TC�Q>��zA�����#�2�|>���1���=Itj�n��|��e��X���`4v�D#�g��CV��%�CI��s��LL�����{�����	����+���]�
}5)X���W���5��_�[}�������I+B�����G�_���
��9����V�	��\{
a�p�jFP�HG�����o�
a���'���h�\Sb2�'�f���U��n���5!F��:�{��+�~��p6;j9M��t�	%a��/���?i�c�,��
������������Q?���Z�_�JZ�Xb����d�	c�*��DH�����x]�ht��������eG��OAW����a�}cb�.lN��x�w���:I{����x"8f(#*���R��xf_
�V�6>���,�-8�%��T�UI����E3��g�
i-l����_C���_�~�+��^�����q����I�{����9fU�\Z�/x�������.��e%W�JId���K�9��#��!�5fvI3��D[�B��w�c���q��R����#����0C�\
�w����d8��?�^�GDZ���3�
�{]<��v]6C��:s�����������<��nP��7u��.�d�j�#C������������)�a�������Ja�F�������a�n��8��2b����pn[�bE��Or��!�)3��A�V%�.!�`S�f��������tjD��8r�R:*���W}R�#�x�g��n)�[���6�K�[�G!/m����W!I����yD3�V��/�%��i�v�1N_����4uO�H��U��s
Z�VL.{Q����opBd�q���$���<OXQ:���9��������H�����e��z~��h}�FK����Y0��-�n���&���Px��$�l��6��>�����zV�7�im:t��$��x�aW�������M
X��zs�Qf���[�X��;?y���&d����c��a�A��3��l��Q� ���r�nB)�[�[f-�-�v�^���s������Mw�Yd!6H��'"���!M���8��c��!H�T���������UI5��`��� m�a����-��(m��i\�5i�ld��;'��=<5�x�|=u���@�]/�Q�I1:iI����2���������vZq�@Vl��ii� ��h���1:B�LRRE�������cwl��D�T��m�������uC�Ju>d�
�'Y�	
0�s��I�l�|����7�Z�`����6L�z�-��"�|mpE�k�O���f\mo�e��0�9����}����F}&��9����
&��TT^!��|�g���
JP�b��,�%}G����'B:�~sw4W<V��s�EK+Ej|j�e&�[\zc_Ko�IH������\Ky]�l�*��f��.��������;�#�"�~��|�������	<�����n���
�$3�;*v��~��i>��[.*��R���<�,���/PK��6G&?�� �I_perf_reports/12GB_preload/ps_8_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]{��4��Oa	!@�J��M��,�C�o������
�&�<�vy|w���<�����������x����M��*o��|gy"�,
?`Os��E�l�Y������Sk�.�4�^���6�c��D�O�?���df?���{6�'b���,�����H6�Z$i��u'kq�o�
���P��&~��@�.���E����������E���_������]���B����d�����|���# ��~����K�^�\�9�o��x������,�����>Jn%���g��g��w�kE��y�����M�%k�h����.g����a�n~��['��Xv�?��1?
7�V��9�e5�����W���_�D���}U�s���o�=�A*� .�k��_l��*�x�����x�1��w���w����p���;����}	�a� �w��z������b%���9������+m��.Y�6<\�����wx'�����b6[��;_M��@�._3��������]�/����.�~����v��UX�W� �����r�t�b,���&"���/�M~g�<��R!��<�.����Fy��Q����_�=�Qu�����G�M'��-��Xk��P�u���V��++�\���l���/��� M�<�M �o~:;S\��Q�����JX��"���a��"d1��9t��4K<w�S�p1����s�?MrXS���b�8GZ�Y����<)v�g|�Ko���u��w�_�[�����P$��A��pK��m���p��	v�K`^���E���s�<�����#/�
}]��M����b�db�����Z���������~���V�Y{�p&�U��]�yk�������^�@����i6Y�i:������"�$�lDR{N���s�iA������	�c�*�6�(45p*z�Tz.af��������S���������7���?�����v��H�]�mt.��B��"�1����Vxq������i�zp�T�d��cF��gYG�)T�q��L�[����}a�@!��S�B���������r�Q�XwI�E�yS���>=9V�0F��4NE�K%N��s�k71�$����(>mE��v��r��3bl=�e��z�5(�s�BM�H�5��~����9�����)�=�2���c�/g
G�3��l��s�k9������:QQ[�>�������ls��F{2[#��
c��1T����En9j���s�$O~��g<m�
�"��S�5j��xgQ��g]���mN�����=I���5z���o����p�(��3���(�"{%����0���OV��>T�vJ���4B��B��
����+�^�.�|����:����g�^��X_1��jv)����Me����0�X��/��L����i���_�>��+����"�9�D^>�-U��i:P3���{W����[��������y&n�<,D4��1�.�����V��`������$��������K��Cc�lbI�=H�]���J�{{<�-�$;�p�%Igh�����q�����C�L���N�#��O.�+)��k-������Lt�.<���.Pf��$��r���k��k��kvB��u��c�&�b2mVB��zpT�p
0���]#5��&�A��|\�g����:8�F�`�-@d���|���*J0��8e3*e�s��#��o;k
eM�#A��~	�_�rN=y�^Fu�(up*��F�xb�yLe�Zi����C6��NuYz�]�����r�o�q��Y��p�@V�j%LA�
���Xs�w�Q��e� �q��c�2�bj,*�5�MAl�mc�]c����rMAlw17����Y��Olg=~A��h��VGJ6�� ������G�
2�@�<�fvD-�F.g��O\���^h8)dO_q���z��0)G���)�la���M���4NE�.�z���|�o�^>�F
��k�7����H��{o(���wq���#���"kA[��<?L�i����o��w�����^��q&u�����(��������?�.�����T�������R���+��d!��=�I{�#[7$��������l�jL��&�A�@��Z���s
A
���^���h�8F��������+>U��T[A1����9c��-N�\�)��� _�1EL������S���t\�.m�)JM���T��x&�2Bk��rR�R�	�������N���T���L�r#m����+��
!3�p�����%B��=�c9/�����8�����%Ws�����r���ZKe8F0=#���v�2} ���p2}��U�G�����YO����|��^��������Tt��i�b8�.�8	���;����k���H�����SQB�yH���I��Ex�N��a��\�%��M���mZ�V��8j�8��T�=��*��f���\����F����yr)�;�h�����������c����FW���a�}?���`���Ch?t�;�LU�L����d5�8FD*@���Ip
g������8�����'p(+���<�w��<
e�pN����:�s|/i�O�T�i�V��4D��\>-{���#�@��"e>��|�2HYL\�M�b��P���n�W����"�}�p�������l������o�;��l�rZ;��)Y�e���V�~s��B��3��t`��������D/]]���95���d\���p<���JT�@��<�%�;����/������e�B�����`�[����<�2�Q�OT�|�T���9� ����-�@F�<��w�A�������B�I��6�C��W�zH�m(?	�6�j����3���~��k�h����`�O�u5
]�)T1K.<�*�kQD�{2���`�>�Yy	��b���x��b���`4���w�<I�T��C-i4jaMV�.iS~.��K��y"KF��:��^
S��{Ik�l����EXJ�e*�n��_/�
7<��\G �S�T�G��>S?G����������k����6�s�;E�g(�����\#�^S�^�^C�F�q?�E�����)�p�5�T	��b��w��������f�?���_{a��o+o����
��'_�O'�i���.t?���A@�h�e��~���Eso�����xZ��,�V�4
6�!����)}z(��uf�m?/�_��b0�����6GE��A��]Jx�mj\x������t�����Z��.2�N���&KG,-;hU����1���q�
#k3����z��0��2f[B���7 �]G����s���7?0�CUN������0�Hu�����)���#1819!���|�[�9�A��;HQ�T�O�%��k0�nn=Z`���j�F������{t������p�8NI�S�kx���v��O�p�,�`9(j���`/��TR��as|G�!��t��;�J��<]��%C��m�0��L��(��2J�pjd����n�#��>��������y�����1`_���xFe�w��AkJ��R��H;,���������=0I�R�n���'�'DQ�����:^�Aj;��OK�|e�~�9��O��qUJ���M}���-S�*Z��*(��*C�*X�KU07H��L>�����H,���N��B�E��N�+f��k8���
�7���%��@
��8�	|BE�j�y���t����������NE
9��q������"Q��(��P���<�a�m���g����8DU7a<�� �`=�$��.�Q����~z�/��s
Q]�/;� ����?�����P�g�nx���/�D�kr�����I���)��?o�D{�U{N�c���X��M	oJ%����vT���*5��
�������+G����W)����c����bI%�g�>�K���)l�U8d��6K�_K��z��z��|�i������U��4q���n�MgWH!��c@���A5&W�4p��2�;S![y�m�6���c@���?�n�� �k�c@�L�Yd5L�T�s�)�
l�|W�>R���1��O!�����3�e��H���u��2.�w��=
e���Y�0�E��)a���/�E�K�df2{JL�z�(����#'�ip��w_`��VG|����Yu�9#��=Nc?I���tX�1�K�R����5`�p����3j�9��I�������famEJ��hz>-�L`"s��:g%��F�
K/i�����%&�M�Oz�x�[',X���L����L&�����z��g~�9L=��&v���pb�0����FK�}�b>������|��O���#6=|��,�C��H���/&�����OU����#^!��P�%����H���~�P�"?��SX��!]�P��T�o��=�c������y@�*��?wzU��9��(�{YS��������W[;��q�?��\��$��`�\mv]e����J�����p3��9�@y@��i�$A�T�O����w��=%r��]I�;B����C������"W�)^�:yx��L?�e����1��Y�]��d�a*���F�q�O���B��(�N��$��c`s|*�}Ve�����s
U����=O���p�=��C��(&Zy����:��c���l�~��j�!{^
����)���|){M��c�L��Jd�4w
�Gd���>�{���n�^�,����@�z�!�����������2�Rld�7����"�%�l�����7k�����
��W��C_�p,�V!E��'26��������!c��zo�MUFv�~3�c-~��4�1���1F1�1F��1�6�1���1�7�1F��1F2���PL���FN~���@�h��{�����@�g��z2������C���gg��� �u��c��0���dv�s�C��/��b�<x��V{2���������=A�x��q���9K�MTzr
�MT0j�B/E�<J��a*v��Us�^e�\y��L��S���
����uH�i�WJj>���u#����9�2����0P5�8\��$��Mn����E
A�`0/�f�.�zp���BZ�����>0��h�4���-�8DU��jFR�G����vJ��*D��T�������ayH���!�!X�r��&c]��-�YBZb��i��Y��(3X��&���O
�����fvy��C����= ��1������}�dg����u�-���:�8&[��
�iwu�Ng:�4����
���j�'��g���c_������
�@��i��5C8���*4����i���N&g
�B�dl$I�hj�����r�2c��"�n�9?T��C0A�����;H[�8-Ib����0BNNj��VN�Z��jb�H#]�{*�ln��4����q#��s��J':�\����}��N��5N8��W�@��`UH�*T�X���
EeF�D��(
Y��������?��.���4����{8�P�C�(T{��O�)�����vGR��pN�p,�|Dnv�����M���
�)VaU9c�I���}���u�B�]|v]���gL�3��YU��n�Dw����I�>>_k���b�
U	���V(�R=��XE�#�l�HsVJ���3R���*Zq���O��������2u�����@���:�������)I&O�f��ph	��di�r��8fK|�N`��R�P��s�)�}oxj:V�+�_S�Sf�j5O?�^�0��b�)���l]�����L���;���������	���Q>��2�K��7�*�uG>�������_�^e&��@�*v:��{�q'��8��s�<74n�x�x�UA�)��x���/�3�sv��L�V
U
u���X$��Y.��y|=j6���	�,��f))@�j$�{?��������F�iF�F������*IM��Z�<K{L���[�YS��Wol�1�����Eqf ���<�1���e��1��"f;Q����6K����������fqyAL�f��fT�������G����7����c��&
�rhX������YL��]V��X`������A����V<�Q��gT��������}���_�t������Px����v?����b��KX1~[�
:8�;4�>����?����C4����M����P���h��*��H�� ��`s���#1O�G9�K������RJf�:D������"YR�!^�G���s�J3�I�w�9DOh\���!���x��b���-�����E�����""��6���X��<d���DA����^�;��!��7��E�v5��5��|��z��KN��+�����#h�U�]�\D;����0�0ma������a ���"B	�m�x}B�x�5��
�;����s��q�6��l��1�v���g��|�}>�4��~R9L�[���#�
��u�U��"q����e����l1��W��~��H�yn����:EST�p��Fw��Livm��`��m'A}�C���X^U�F�4��i
��(R,8aV����N��� ea8x�{#�i���#1�N^�i�?N]~*���l��~$|�^���`���c��w�����k��q��[���8
��_��gg�P/��mGF�Xn�Ui���R����kM�]������`��)B��<�)|��xp�e��K%��c�����^G��z���*&R8-G�-���.� �J��S����#6#���BY�����E�a���C�����%%}{�W!A��� ����=^�i���!�����8��4�������G�k7q<�Z��=��T�b��!���A���2%I��%y�-%��c�����q�+vRV})b��:��6n���e$M�*G/i�&h
��r�����
�*Ak��1���������%���|�1�����h>j8�q��v=�P���W�\[WE|mm����g"������TH�+]�VqvuT�o\V?(���y��)��q|d)�r����d�J�����AD:����j�������_X�D�[�`���_���L�F^D$���2����YVR:8�	Zf�>����>��Pz����o;Mh�f�-J��D~{8�� ��B��D��c��� ���W�E��s��lJJ�yXXU<[�����}��������P�����
��4;^���p,#(�1�yl�����<��"�I��k��C����������|P	��&�	��`?[���u���aN��2�&q��������M>\(7�!d�!iL 7�P�;�?��� q��L�d&z��w4����B��Y�D>J���+�;� ������q�d9�I&�A2�EY��L+�C��G���	��ZcG�� /<���)'��0�,�dFGm�1h�o�Y�&Qh����W]L�����u���^�`����k.t+��y�`a$�@�W���T�p��e��pT�T�L<�c��������2�����p��F�0q&���N]�Ox�T��������C,�?iV:7�R�l��+��R[px[a3C����QY�A��%�A���
S��f�S���������i�H��3^]$�Rv�U{V�������.o��p�g2���e�,��:�OX���j*68�Qs*����_�����OYOv1�����J�����2����
��n
:�^QYzCh������>��
xs�c�	]�`�����4yi���Qs)
gCFM�S�?���!7�pF��y(��O�I���AUmX�O�	��q�Y�\2�2'�=/:^Gv$dgx�r��3����*��2��<[�h��E;����6U7���TC��eb����^G�P�8���1�E��s���0��7J[;�[����v���r;��&j]�;��cX���gN��q���
8Fv'0O��V�3��%h������Z�/�����$�;���`�Ql�:���^S
(�F$��8���sLV�n.�� ����C'qt��da�e���dXH�	��$��I�o���Kv�v��g�$�����@:M���FWJ�Bg0l���#H6p&��i(2����u+Do?���p�
��
�<Q���E�s�z0)�A������������\5����������V}�&X����;V���iU5w�,���S��FG�?��)v�u�Q��)Y��=,W�*Z�=&$9��p,�����v�B��2H��iQ����sK x������k���Pt6��$�_��|q�xuzpx=�������H�H���0
�s0$���X�_7|d��D���1/��?PK��6G�z��I�/_perf_reports/12GB_preload/ps_8_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt�i��F�{���j+��@$�*�>�S}����!�,]y���� 		,���}e>��d����=�c^�������{�,O���]�!�\��4������5-0uc����4����w���&~���Px����lmf^����i�&B[�[��ma��H6p+�4�����Z��-���k!D���SD3"��a���n� <|���E���/�L�o}�6�$�H��������e�u��}D��?����Q��&��(G�E�g�3k���_�8�B��]��K�m�2,ka�����n]��N�&�����b����d
W[����/,kX]��(���l�����d�{�Q~m��|�;~2�?�����]�?>��$�}�kU���E��x{D0H�����5����7b��'|���.������}�����A�����V�x9���8;��}����}�����h?<�v{��wa�	����7��z+|R|��9�}�����N{W�������.�|�$w-�5|�v^�O����'����o���Z���Vq�]hU���f!�|YpE����o�q9��H�`�_��aji���/���b��=�����:�tuuU���>�;=6~��Z�4g�6�D�{���Q*�c�|[3Y�����f�������,���a��k������H?�
��6�q����/��D��xa����_���������������Z	�:�:8��������^�����I�(K��,v�n���;*V5�4�2�Q/
����d�L�e������h�_�Vk��R^uT�&q�'��	�����z�}����U2�s}/�����p��T��L�4�Rz������q����@_|`->�H���?@l��>��0���Al�0����_��(����(@�!���=�`��l)6�NE�)�=	��jrQYD*HK��d�P�8Er���q�cx�>3���_�N�.����D ���e�
M/�]����H�,�w�~��J���i��w"��.����j�9�$�����^U�6I�������]o�.�	�:Y���f���P�2*�"�,��&V�����rb������Qa~���\d�&���.$�~�l6���Ac�a
�{�R[�V��`/'�'	�J��=�V���v��^���\�S�n�l_�"'U%�oZ-�����Yv��P����i�0�\[���^�H#��{�]���^��V
���G��f�K�1�3�Ci���\��/fA�Z���sm��g�,��������cs���>
^�5�jR�|��B<������TX�a�	����8�����.�X��KR��N���	m�6��DT�&��	��F;��u[,%�5 �n���k��v�j�,����z��j���WF���c]�]����JK@�%���i	��$Z:-���HK�OK@�%,]R�0�G��z��^ �����D�&�!~QR��	/���j�K�����9x�Q���ci��V)��nTp*Lm�^�O���]��"tcR�s�fas�AV�r4��1���|W����%�d�[2N��<{��6h%:�������`e*�_V`�o����R���h��9jg����d��'����@��U�z���To]���]�d�bU��
7w�[�R{�[�0W��"Q�
����7����N����Ry�FT�mi&_��[h�Bs���%���������:����>�K`uHi(��L>
�Y'�!H�<�r4R8%����N
�(��JM`*�N1���l�J��J�k�����lE$2�M$��#s��N�_��|?�����bh��{<6{�1{�;�M$/h�{<��*����]Ke(!�Bgs�����K��x@��p�b������q�P�AtpFs$��z����?�PCT�$@F+#-9�XjKy�d�e���	FX&g�`�e"6(���%��� %j�����|���\���$/����u)X�O!�������H��}���>��Y��}k�zh�Cd���z��P�s��uW�f��7|>����(`,�c�x	rF��rp��]lP
�l�!L�*�y�r*�!�"P?���h4���j��"!�G1���G�DW-�wZJ���!��'Rk!�
7Q2LN��$��.)+F
*O�a�54��[s�k��t��2���[��U"Vg���0S��|�.���'(G*(-������.	H�c`� WK�(�Z��*���(���*���,�����ePbPe��X�9������v��,nPF�ah�[�G(�����z�p��S����n��Qy������t!���
�B�LQ�c����T�2]S�/���.i	�� S	��&=[Ef�eR�a��MT����xu7K;c{	"����p0d�u��^q�}U{+2y�7k�s�I����boOo�Po��Ah$r� �z�W��������,�"�K�(�B���,��������+3��(���zr�`Z=T�_�f�C��wSw�|�!�[�h�$+I�HS#�:Y�a
1���V����}�Gh�#�9�a�M����bG�";�`�P�4������gc`q4p�H|E����m0c_*�KxQ�QJ�
�sB�}�g'U�D����T58'l��0j`�U�.lXY�%�����>�0x��PT."��_�6C�?�-�I7G�0�0q���0�wG�0��	��(��nj%Y���A�������C���&j������-��g�����Ld����\pN8��b�d��)����km����q��L�OZ�r����/��4����$�
l,u�E��a�+W������6�jt�����~������/�0��.���KLf��`b���:�����}�Z���z�>�0���Q�d�����}�zcF�q�����q,������1\����j��y�BL���z#�j�l�r�B<�h��hg����jt^k����1ZL#��D,_~�������9v�F��h6&�z����hT02yd�l�c�?��.�G�2&����������cd}��g�(��v��F�����im��8�����`�����;!BI�Mw������o����+n���2�<��sB��g��po&R�����UE�!�l����`6A}P<�����G3_���j�oe+��A�gg���s2{�,L"9e��.�M�?`x(,���
�:>�<q����9��'�0 \k��j1���~�E#�my��-I�8���M���r�S�J�8��%u�$����pb�1�
��=����{��|�s�;�>��4��-&��!m],&�&��-�93���>OT0H3\�E���c���g(�4�W�Mu���~pN��\�r������Y�]��f�����
��_�l��f���
�[5P�������I����%��m�A:�6Y�YM�
&��=t}�k?s��*�X�#-�t�p��~b�'��h��-��3�F��f/����������(���E�o�p���^�7&���!M����
gJx	L	#3���2��� I��;�R
cb��)a�X$�0��PVR�&���<��f��m;�Bk�7�5���Xx��s*n��j��d�D� C�V3oc�����0���E~��-6L�R������=��{��H��������������@j��5`%���������z�+X�����l}�����C����I�����E�PL�k��8o��1�r�?v�b,����ejQf�q%
!&�hv�	?E��!!�#Ew�z��q�.��E��r��c��2�PQ�����_	���:3V
���U�J'�5MPP/AE��"	\��TFP���K#����!�q��G5�<�;@����/JT�f�}�5���)C���&UX$$M,�da�3{�x[q���|��Owu�5�S��
�	-�{��Q���-����y_.�zA��.�h��]
�c4Q��)��TleJ2������6*�=Ka���d�M����s�M�C&�����=���+�	�wI@ZC4}�.L�
��M�Q��4U�WDUE��QUQy�TU�/����`���A����=+T�7
��G�y�tT"�<��2�q6�+��P�b[R_�r\����6<Z�*�q6(��^�Q���}_+������[+
q����[+�b,kE=����l�
/�������-�����c|�4k�a�������-����br��(�fV6+X�yd�-�mVVa_\7��'���������0�F��P+�������9��/��K���z=PD�D�����d��J���M�t�*��F�����N�kaF�`4����AS���7K�I��{!U��j������@\|�_| -�,]���Xy����zK��:�9ab�`�<��u;/�!G�~�b��s�����j/��xfF�o�x��aJwgYE��P�X2��T�G��9kG�l���,4|
:�O�B�W����7;�}OyT�
y#{�do�d�@o�m8���c
����`b�q����ET��	�$P�
��Q��%�sBt���|o>���~��M��j�_{�x������[�����Z���4��	�KcY�������Q��h^du��c+�6EV��zYY�kH{��h�v���Az������W�C��|������nv&����vG8j���3{�wko�6���3��������kb�8��?�@���<d�w*���B4;Z����p�p tvpB�pN�/�����`�X`D��������8N�e^x���M���@B ^�J��-��K� ��'�6�f���wg?���x��3����c��.��w��scM����fQ�/
D��I�Sj�R�d4����1VV�
��O����Dn#;`��8����NV�gj�8�E����b,�u��B~�4�Q5�q:�0F��-b�=>������_�j���&�f�'���x_�lB�vV	�G3^����8S6��Y������y��GO�������*=w/�W��0g�L"������-�P�|���>�\�w�/�������������>y6�M�*m����]X�[����val�vq��vQt�v=�!������E�� ���0����7k����:��b�����]8e�2A*k�bY�h���
*��w�J����8��������i�$�,p��D��-u*]�t��V��/��g�}]�k�����]������Rv�Z^�������~��r������������^Ym}��dMn+kr{Y��&k�^dMnc�y�����X�>���v����ra �u"�X�y4���f��9B�4d�����+�X�����T����-��g��_]��t�@:jS�����_I�%%����UR7��nh�������uK��y����oH�c���J>���~�c�i_�������B'lba�����4�5�I��W
E�
�����4p�K�e�jy�����< >��S���JozW|�]��w���_zW��]���g��������d�y2���jZey�������+��z�[Z�xO�=5�Yr�=5���q�	��������\Q�
z\��d| ���1�8\���q�yw����q�����=���zb��t�xkdZ!��U��BNi��I�Z��	-e�h�����x�����f��H��[�q��&N�������9_E�@�������)�E����u�
/����Yi����(2(�E���c��������Lh_�JQs�s`��@V�so%X��w�h6
���{��`�HN�i�t��j�Q'���42��	�����x�c.�u�N�9G�JC!�����dsJh��Hk�s����/��/��/nB��d���dV!��t�DR-�
��$��q�r(7S���j]������2�h�r3ct�8��X�"���
�]Uzh:��kf0�`t�]�i6 ���u���!~vC%�F��!�i=/�-�7r=�Q����D�M4v�	��Dih1�q�'�����6p���4N���S�w�O�s�����e��4�9�=2�x��5�=����^��)[�C�a�tO)|�������p�-���I4M����d���>G�A�*��T����g��1���������
=�L`���������
x�/���w��q3K-7c��nq5a�����������*�{�b�B�Y+��z�H��*�|x�v

�6��b)��P�J��W2�v$����A��*���&�"��U�
��3'��X�H����]�<B����'���N$��H�9Z��H�k�?�����8��������
�Uc��,T�YJ ����Wj��W.��%��E�<�>eH&Q~�5��c�]<�/������^��/E��&_S�������1�c#�1e��:>��W��~w9�~�3���Q�vS����(�Sx�~�?��Oa�ST�)Z�?��Oa�SX������<�MB
"X�?���4��}6"����Y
��Dk)o)e)k)��R�F�\�A*��7�~�~�������6��j�^�[����S�G��f����?n�������jQ������.�����sz��U����::/������A2��M5,�f���&�0��������[�	�<���Xci��lr����'�0�`�W�V<�7��ld���Z}C�����}#2�F�R�S*�N� #�4��IkP�H��c�l�d�W*Yo�u���n�d��������FE��]�����U�������'�����^m
�6=9������� �	�8�-�@�{�`W�GQ��cA�u��b���n`�z����=�@`v11|`O>H��B��)�a�aD!L&�����_11�yu�GI3���8��������U���c�Y���4��4���K�� .���h�Y�qJ48�=������$bg������'t<��y�v)�v]�tn5�a�Gf6U��v��o��f_=���l��������������_���n��. ~����t�;���z��q�k����L���i=0'�"��xd�y�����+��zV.�������!~�9���XC��?�7�a��Y7���z���	�A�[�<m�W��T���g�/g]�\�f>��O�(6�����$�2fk"pm��^�6�PtFM�����}�p�&���[K�y�w5[�����4c���5��7(��7&��74�=h�>/���o�N��:��vO��~����a�?J�`hL��G���D����}�G)���2x��6�U��<� y��x��1k���t��k�p�������~���FI���d��Y��#Y��H���u�_��-�?��C�������1�;�=Z8p�2����q����v���9�s��4�|��}w�h��N����L�
���xp,
�:~�'5���3���pjL�@
Q�&���jKU�N��M�7,Y�I3��cJ8�A� ��� ��l05�z�Yx�D��e�Y@(��e��M�i47���_qn������0hxQC�+|�����k^�4���WX���t�2��
,�Q�@w/RR8X��f�����������������%��%*SK������P�����XM��@]����������+_���8R��.���c�h@�����WO����T�7�����Y����T����9M+a���@�o�n�s�i�i���J������#7����A�T�5pMZS��T�5pM�������kc:�r���rx�k
P2G:/S�f��[n����}�{�����0;���\|���c�n������i��������dG��{�������������[�{�ki�����u����W���h6}U1������|���hv,D5�&3��I�TQ���i����(�>�����G�V�����������h� �tY�����{.D>�{.t����O���?o�o�����4�������FT�F��Fx}#��V��o4,��Q��
G@�����zrP�(���a�]�	���F�Y&�oL�����������7EI� ��K0]��A�����?�zq�����F�?Q8�;�p�D��$�(^Z�O�N�P,�O�!�x�B<D#�����!*��x/B���������2��!�~�}���u;,����*������B%��((�H���y���A�cxa����-�g�j�B0�`* �j
�%�pBQ{j����
$��(�D����)�(W�(^���W�4t.wdX�:Z��]�Y����KS&Y��D������a������m�����q9�c�����b�GJ
�[0�4*���}��Z&��H��V.�Q{��kv������q����`B�6e'LA�"�>���|���~x~)
����������V1���XX0O�Y|l������`Fg^]-�Y����{S)����������t&�oOM��]�XO�4�����rU����.���S��X�^��4f"o�����of��3�d��Yi���Iz��>eQ1u
�Q�O!3�_����i�]8���i�#�c`8'�����4�����t����$(�}��mT7�'��,�a?��}D��(�yN$��d{,x��@8�%�xy�{�_��L��2�F�@���Y���y�2���Q����l��+�?��@A�L���N������s^L������������G��A5YQ�i�������-v�	��	`@���`@R��qa@\��7a���)6e�-�R��D�#��5�ST����
�1<&*��
Kc�^�n���*pm��y2&Oj�0���!��n ���������1�8����k���9f�.��)b]�������~��DlG���i���e����M�
od�B���y<�x2�P��|�O�	g>��O ���|����a]p���1�f��r�V�����C�dM��G0��j�,�Q���<G3��p�X���.m�N�����W�������y����OT�������QO$j��~����Q�������w�j^��aX��9����6��,�	dQ�>��N��U_���.��(`�2C7Vi����t�fF����������V�6���#sp�M;�!��O^y��x:N�-���8��)3>v
�����x%
d�B��2Z����(B��V*���F*�`{W�,2�-�J��t�5�1�1�1La8r�Nz��>`��L�y�n�����c�c��)j�:�}�v�{�Z��O�7�s�W6�	���O�`�cq���a�O[��Z)�]�i�����jZ>�4XftOU�o���[�(?[u`���iF�-nD$� !7����qL���s��<.����i"���8���hrI?���NUe���8��R�6T
���r22�ldh,#Cc��b$����C|��y�����g1������7-%�EIP#e��O�M"��L8gd�I%I&I�>]"�$q~����6�����\��"pRx�	�a� ,�O��z��D��ce�����i8hH�kH������pZ!Sa�m�X
a�^
0p�G��p�X�d`@j1Q�<�}[�aR����V:��~�{��?�jI�SK\�#pu��8�b3��TMG�t9���7��
<���3��@����
����1Qa<�,u{���@��U�����z{����G�#s	���$������$PAh��e���}���8��)����P�4��H��gjVW��5V�tKTI /��
��n%�'5$6��~>_,�*��*�:U����!pCZ�kv�@���}:�vR�v6�o7w�o��n�k�g;gtz{��T���X�C��H���.��i�����$��Qj�^�:�d~r�G����9C-pR-+��&r�P��}�O�	�>��O ���}��o�u*�=����AC�_KU�����m�;�'5�������[�����fxjA)���{ �ve��aW:�^��	2���s:7�X'�W�����J|�(�����T�f�T}�-~��d]�=5t��.k��]���K�x���Z����w��w�'�O�V����e���s��gz~�����u���G:�VB����D������aw���X<�����tg���z�}^T{�n�����G�}U>�����eA����:�~����}5��K:.�~::|����ml�c����j�j�6�$��i8N�
L��Y6�_�4����D��/z�oV�"�>}a��0{�1t%a��@]I��$PW�+	���J�M��K@	P�(A��A�-d�����#������i�TG��r������}�S��7z�������z;t��$��3J��j�����l�g���	�A�R������	^5MMu��Ut��J�����kZ$�������+�Veg��7"�bq(��/E{����s:�������x��5�����c�_�*�D	4�DI���@u���L��u�@��4[��� �(V��8�_���nI��B���i�x�K�)�
U�����"@\ ��T�/���qE`�����a����X���g��N��$u�^Nd�kk��Qr�''K�!Sv�=�P���}6eS%��b��t'a���=��q�3�c�h�����on��zDy�/D�,C�f��>��?�{V��W�$P�$Q��1���SHzL!�1�X��S�zL�
U�����%T�j	U�t�y�$l[������L���WoT����d��5���,^�cf��S:z_�;}��}��6tJ����Leu�����n���&!~&|M#
����W��y����aU���������e}���cu�-�,�
��4
@��'����hD���]�u��sx0/��:��%�D���{�k�������	�BBv!!HHhPH�������(F��f���� t������������s#������6�T:�um�u������P����*)^����f������]�Nw�B����i�.k#�v]^	����b������"�����j�K�X�-���b���6�Z�7W[m���5�������h��p���o����������u;X(;:�:�Y�f���wk�a���O�]�J��+���.gt�aS��������:���z�m_�����a:�n:Gjj��h`�TMSI.�W���f�~A����	{��_r	$��A���<�[�)��-�
y�u���� <H����� <H
���X� .<�����3���4�;�:IT�U�u�B�[2o�.���5R�����u��������
���+�Y�Q�������1��*�?b�d*����.f���
ZQ�t��5��e���������!���=�@T�5�G���8+�#
�x�����0e�e�B���)4t��m�}[����SX��y�X����59�-@��j�c�z[�V�����������_��/��A��Jp����\��!_�]o�]�t��p��f�����[������
{�F]]���6����$�V�t�S����������x�W��t\������[4U���qj��z�/�,��N�0�K��Jc?h��)o�nG��R�T<�O(�	�>��'���s�X�'��d�>A��A�;(�'��m���vSi��'�CZP]�t�ME\}�+�]^���M5t
3T/�Y��WT�Q�O��Q{���1<=�N�
��������h%�=�[�&TB	<�P�#P�<��@�#H�n����}U�vH� �����`?>�-{`��C�i.�PM&��[\�a��~�'Y��r�|�\�a�5�u��q���&*4�g��k���������^�A4���U0�0���a���.�6v|��0���b�1��TE��N���J;w4/+=x����)��
��M���0�hCA�I�|�����+kr���/}�X���6�#3=���KG#���Yru�f��KP�%[�	��v��I#
��A4L>h���|��`������5��7Zq����
NAXk�X(�:K��)��z��K�k.WFzYh����R�[t��c:V7t�?�((��<�������
��)� K@��CA:CQ
2��0�-O
��g(��4�c�M��m���(J���
�MG/��������|]���=i{�9�=�c����D����4caW��Fqf*{QU�6��K������=q��x�BVe�_��ajX�q'�|>"~�����@a3V�]y+�u��v�*n�FUmm2;���^L��_�����z3T�������Gh������r�?���,-�]���I�Es�;���)��)5��6�f6c�_����/Q~i>�s�de��d�����[Tj-X��RVu5��U�H�`U_N��z�/���U,&�CX�?��
��`4�H�����3�I��I �I��I �@Uu@�o@Tj@�d@W_��Y@TT`�N4��4���}-O����yg��tD����f_@i�� ����DZ����_�z_'�f4�?{��4=�|1#�qY����A-
A��+�t��
/��|8��8��U�oh�mF;�t8�7j��&8iW���5`nI]��7����h�{Y��}�G�����=_��G��:�r�	
\\y�z7��8Re�����7��@���-:Hk��K!�:P��T'�����=���������*����{%�����\z�]����G��	<����j������MX�t��:�kj3����u�l�T��.��W����l���k���~�{�s�>[���>�^{�z����i��������o�&gA2{L&��6@l����:9PMX����-!�4�f>��������0T����bj}_�<�o�V����e�Tv���$c�\Zy���Fe��6�N��K��
"d��"�&�bfuU���	��:��<?���"��;�m������ei��0��
=?J�,�7Z�C��ms����z��i��(_���2�R�E��������:L|-���'��p'�d��.���51�aL
���������c�mF����mE�cGF��h�Xo.\8����i�4��@(
T��@(�
t��@(�JT���.�[��� �MHBg�F��:�Q]I�T�V���:lu����s�Y�=yY=�8��P�v��B�?'�iA���X��?~���a��������l�f�40B��I���������U���@�M&��j�ZA:���Q �nkW�x�0o����J:�G�.�	��4�x@�T�x@��t�x@��$��v=!�#F���F��3�j>�J�>'��W�r?�B�o�	��\I����5���s�N���,Ukf�(��4y�"��'$�2�����m��5{@�=0��fO4�	�59)�~�7����D&�����)?��oEz���7:u*�ZF�
��5yK�����:�`6gx��g��<M�c%���}O$��6����
4I��U'M�	F��b�Q���B��`��P���Qy��D�o��^��N��w�8���Y[&�����8
�z�"���i���Oz����E�����m���o�|H�}~��4��t�X���4�QR��/�%{�j�h��]�<��������;�SPz?{e�p��!���kT����Xg0�n��/��^eD�	�!��2�s)E�kH�t�S�HKG^(I*I.�-	yN��UY��3���qZ`��_YO�7�1
�:U�NL���o���m�}��G���8�T�8
������:��W`V����
#Tt���.e�,\�'w��W�'�������H��A�;C���P��3V�b_H��A�M+��IP�s?���7,��cA���u]h8j����h��g^xO�����
{�tf�Aa�G-1�k)�
��,��L�~
?1��@j����2�KAh���{
.�����x�*&�.KQ�Z�8�8F�8��N�O�XH�����Wf������8�3�Gf��P���?��"I�%�u%e�������{% &-�[�m��G�!�S2��3�cM�����3��0����e��@�<��)�z���c&6_�Y�<���Y��P�[��0�UV��#�u1cW�N:��BsN�xC�0z����� ���O������]����{[����7��k�<��|���(��/��'�T#��YB����H�R�z�pY���w��:k:���x��%���O���Q����&��LP��a;�_��������������������{
�������pA��4�G�y�5n�J^�������f�	~��F��.��L�I~�Z���c����:�\�����<
+��K�PZ��P��DP=�����d��/y.�R����?b���~(B���a�-�kD[�<,��`_B�1V���CB@���B��I��pdB�\~�-_�p����� ��s^K���M�q�a\����\��
�&�b
Q����P�N��q~�<?@������U�q���
[�y�������M��*���GD���-&�/%��J�*��c�_��L#x�XR��+�Y\+gEw6m&�<`�	"����
�5��>�isjV\�/T|��m�B7-��Xd���� E`X��[?��*Q1������!����5�����o�.��������P����%�XVP�*�\vkP�f$	�C���mV��	�i��
<�����dfs�d����^h7�����'�B��������<{�N�08a�O��aH��$��z���4U<�<\8i$0L��f0	L������&��[�{�[?� U�tNG�<�u�s�[����u������M���=l�F�	�����}��:�����2����{P�9,nQ��(���
�k��2����Q�0�dvG|s~�l�W���8�LD��p��AiWV�Q��)B���lE�F��i�i`���}�_�E5o���PV��FM���[u�����m���~��������Z�,�3I/*��c��,.0	h�AP��Z:�%��~�� ���r�S�Iw����9�������u,�7�5'�[s�
)�����C�u��1h7&��t�����v��q1��c�
��p(�Nv=0W�h_z����A�c��k9�9~mRf��e��p;t^O��T}:���!�{q�������a(B�)��i���H���(+J?���q\.�(�>������m�x�����F�Nl��@���&�q�����i9o���sf[NW�w��<|qP)�B����[�L/b������C���#9��N
{=�{����e\q��|l!��������8���3�	�K�J�f0(�+��H�?��8M�����X��9�*�����@�������(��
:a�K��"�oU��HF������\�� -/������7��So��ro�^\Z���@f�i1���j=�������U���$9;�<��3���jW��G�����z�2\��Jc���nE7h5:������r��TE71�na8�"����SG��A��F��j�no*�x���U�����n1���j$��	�c":���=�����!�[�x�^E��9i����?�^v9�����ZMf�.��)���Ox@�L@�4DI~���*��#xtg!4�
;@���|�;@���D��;@���"�_u"���-c8�����N/3�0����_yIx�� ��W��`����P���i<�]�}��b/H�jnY����0���Q�j���Q�����wG����L&���oR�i�M0����3t,$_7*��Gf�n?�&��7��}VQ�h1I*��X4�0I�����\����Dn������]:��6���T|�i�A5t, �?\���GM�e��3��F��/[{a�����):L���)��	Qp������mM�vU��a��r�p:4���~�>���%��/�|+T"�"i!	��e�?�'a&��	�������{�����#������|U���Yd��?:G=h��Mq6'6U��x�-B�	Z:G�0�ib���A�� Y�DJV��i���:���a��E�*x�?�<sEe��Q�r��+���3U���opm:��T�	�8�a*jeA�����<����6���\���z�us"�s�_~���f�V����8��QH������;Y�������qz��Z���������X����^{/E��>�������$1���@�����H�z;�A���\[q���\�����t�zj�q���Q�����|���o�7~�����0�1^��Z����������t�����E}�'��U��-�-&�����F�Y���t
v��z^^^����(���&EF2j^#�j:z��IkU�w�E�|��������{hlX}`;-[�*m`l�%����6�MQ"rNA��][��0~�WD�$@���!.w�x�z�tT[���;�_��f�i����K�R����c;���(���J�j�:���U[/`V�",av���,��d���!�*{�=�L�I$up���
N{�s�!���k�gAIwN��6���;?�#?���F�G�����XnUQ3C��������sV����&4��g��O�{h�������$�� �iqD��Q-�j���/'���P�l�7V��/.��,�"?�]�4��Zc�,�����O���.�uG��ag.}�[\s�;s]]D_E�W�'�����vm��:�w����������������>�������IT�����

�z�#���k�x|Y?>�e�f	u�'��SH�[\A�Xq�-8z��m�I_J}�y��]�����u�����z�Z�4W�/����p��%�.U@�:�h
y�v8v�@n�d���O�s|���P����<��_K��F����D.�o\J����?�D�^B-kM��sa����� �E���������p%��T�)��Q2��S�o,���w��-��
�m�.r�(��8��gA�3�$��x��Ny�&o��l�u��i�Ku:�?s���B%�R��82��yQ�����D��o���?P�k��"�7��������\�\��Xz32k|\� *�krm��:�����f�2!mn�K���Z�YT/��.y1�:�3/��������y���od�~v�f��I����,0�������_�}������m����Y����W%72)��q<��R�B����C��BN��|���*�8~�-���o��\��^i#�+_<�?��6���_����:�2�8s�G�v��Y	f7����!��G�M��!	UP$���(��\r������������A�}�3�$Y������c��I5W����H�h�i��a�!�]^��BV_���XW��o�,���L���3��C^�mY�C�&q��cP�� {0�?����_=��M�d
3��+��������G�G���/*?*���������|����T�}�g��cJ���ha����x�p��_�f���<�L����8�t��&��Y�f�E�(_o�x��`w�
���y�d#Sf�s���U��T�����>=3��0h��`,���6:k;�I��N�����2���!z�����"���RV"�;f2��4Lc��s�����W(J�$����<x�(�`���xE������4!|�5��uq��:x8p�uu����'i�q�HB�o��s����`f�������b�
�KP�+��t*IV�AQ)���bJ����yWk11�9�u�g�Zx�axp,�9�d����t���1�ah"��J��~�B�@L��sq�������g����9�3)�aO{��%����mT�����5�r����	f��B7|�9i���?�@���v[�����*.-���6�Dwg�1:�*�'\p��mXE>�1W����T������L���s#��M�q%jt���?8�%D�W?�P,e,��%/���p�wP��.���� �v�MvM���_z�.�o��B.C���O�@��y�����xc3bK� ��v�Jh�Cx���]�T�L�I�8S�z�]�Z��X	�\��qVH�|�Q�����
e5������E�$���XN�F��F���f����s��9�C<#�2�����:+_���O�����@U��w��Sm
Y:�Y8�{M���>���_���X��n\��D��ZR��X���l[2c�<Ss6�}�F�3U��#�p/�t���\/+��<���D���������U8D���
ee�~�L8�8o/�x{_���D�y���R��*]��h��W�mp{Zyr�fT��lsg�~n��L����
�mlhl��#����u8������QJ�VV���H�����&�D^+
Dx�v1�1�wE.��8	����D�0�<���:�j:�����x�H��,��0���)�GV������B��z�|8���1�/	0�G
��5�3Kb���D����R�"�������:Np��J�
mI�����n:|������3��j������S�X�B��`�"����p�mC�(�~> �7�O�O?�3��c��Mi
8VL����ATZ~���+��|�;`z�j�l8da���M�z�/�'�~
Y]�>�!�����Z�� ����������#'����+$�(��\�v��uv�u!X��������Y1�M�/o�c�L��1������P7����&`�PK��6Gck�J:Ud_perf_reports/12GB_preload/ps_8_workers_12GB_preload_2.7_selectivity_8_task_queue_multiplier.txt����4���	!@bC�&
�!NqH ��l�n��I�������M����XX�����'c{.��������"���(3A�����f0M0Vo�+L�X�z�4/�`��
���Ls��������2o��Z3-�����~];�c���m��ey�&�u����w�E��[^�!��C�s4wq"^g8���C����_"�����I�v�|�*|��D���o������t��-�����ou�8�M>dI��W-_s^�W���i���l�f�{��
�^�p��h���^/����8y��n��4���n`�k���Wg,)x�j4���1���5(����i�����_O������+�5��������=���W1W��w%�����#�����8���3r4?z�������C����_|����C�l�z���{
J>�7���(����m�����Mk�%��u�?�ac�	��f�^��e^����0-��W��!K�^������,��+8_p��� �w��>H�����6���O,,[�~���;�
�)p��_��/�G�BO3�/�|�e,������������(H6<<�ty�����o777
����^������s���%�8a�/iv�S'9��j��-��*��{��uK��no�O�r�m�K?b�4��E���[����N�m�o��	i��[���!�qQK�imZlM��`������~,��4��������B���v�v��>��b�??���K��� �F��C�4�+k]t��{?�y
D�=�E���3�<�Tlj�yc��wI�CS!�
��-Fj���h��V����BZ�T������}�v�!������/��W)������|���W�hDbY�Z��2��s��14�'�Lm�R�z�����y�B���`��	�X��lN�Y��z��E�s�xu�%��jz��y����6&�
?,�l���7���`FS�V��������r�5���| M> '�'P���[+�*N�x$��eq�aX�+"��%=
�ei�W��C��������<��2�������=�!��kt�"�:��Y��k�04�2P�I�%3��^~�������k��.���;�k�zb�3�X"��v������,����K�.�����!�����S�9�9�b�&�"?��a���r��`bh�����;��������L�������q���k��O6a�k��:c��������3�����b��0����C_��\���<f�xq|��P�!^�i�T�&�G���#�����b���Y����+z�g�%@�`i	��XZ4-��EK��% � i	�i	P����T]���_����Y�N���A�B����8��%����+��o���A�����p:�!���f���l6+��y"�}�Y��p���XQ�{�n���{����X�L�
���4!���z����d�	�V���vB�p��/���7/J�|�JpeO�a;���S@�)���s
�9�����jN�rMY�%Tv���,{`~�~-�:c�� �� ��1�u��8��o+�Ve�!�Z�A����PG��v�'g���_�R!��uY�K=6>'�*Z\���'�'��w^�B�w�p`����#iF_�I,�d�i�JFqI�0	uF�lrJNZ�I�>�$+D�nD3����(�W�S�ayK�{,�����j�=�
����ehZ[H�?F����* �Mg?_@0'������5}P<_��_�	�;DQ*���	��
���m�?����	OG�	�`YV��f��Y�P'[��<o�r�a��THTH�IdI�"��t:�S��An	UN	�����E'����e���W��`���I��$<p4g�/Gv������*�P!�����@ks�;���Z`y�����\�a.�;_�Y0��U�<�N���-�,_���@���C&� ���H��rq
���D��� �4h�
D����$�{�1��9�r��^�%�R���(;pN���
�A�$���;^�.(��)�����2.�x<Dyz��@J� ��
��N��U�����������lN��I���T<��e8D���p4��|�<EoW���0q�{ �����-�]_������U�a��#���w����X�z����>e�������q������1�72��w���)�op�6s�ce�f������������!L��5,����'�'�k���.9����d��d��I��k�K0q����Po����}��M�4#j:=y9������'�t%�������;t�iz�X����;�����V�9��7��b�R�#���o� '�����<��������C<T�sCS���
�m���-.Y������,CQ�W%�����i�!�s?z�q�������g�s���O2���;�����}>��Z���w@�B�0Hd+��2QN�K��zp���-D�Y�:�����}�b�*u�R@6��^�z�8�����
����N���(��N�1tHg#
���9����B����N'	�#�A�yD �P��+!D�Cx�S?	�?l(�����'�"
ZM�'@
,�H��������r��t��l��d�u��q��=�7e��zn��bG?{(�&+���d�9�#��5Yi����K)\.��wm���������r�gB&�v���,v��<�*��-�������=t]!���>����
����
���wy����o����.�������(��n>�8���t7w�Q���|��%�D���-q�;�&?��B��4��w95��X^#��q�������HY�)U+96��.9A���a���+�c|�"-:"���
R�E,����fr���G�\�	s�'�K0���j�?O#:i� !��c,m"t����-��D�(�B5=����������sGl��b �v��#u����[T�<�*�XT���h���~
~#vQ�EH-�@vQ�E�U@]TuQEfx��`��PI�H��4%�#M�U��r;E{�q1a�}w�q��bbS�N����\'�As��&]8��&<����_���`�n$\�o#"�+��'�pBj�t`O_B��5��>9���v���g�7���cnN��AEKH���4c��I�v�t���5G��n�?��u�EH�"$:����A.z����!�|o$_�;�_��H�����X�O�������o�&�|�2X
�!�4�vs?�����w�OsU_�4((\�u=�]�3�N�F�����\{$3��\@G��a�n���!7��g����VpC,�]B�%�����Lui�]�����d�����o���������GP�qn��qt*	�������5V}1<��B��L��<~D���u����$$/�H^�K^�J^��� +y�.y�$y�*y�Kg��83Z
+s,����
���������!ci����R��ZZ@���di���IK���[Q}t]	w�]�>:��fp�[�c�v������(z�_yb��Z~U� p�\C3������3o=}{. '������j�?�@�|@N>LO>�&_�1�eQ%V��\#��}���=�mAH�A8��DW}�f���\W�x�/!r���D'�N����I��I8��1oRY��,���9�"�r�.��'��������#��B�����64���9\Vf`u�R��n���3&W������4'����\!B�0�0�0�0�0$���Ox@"<@L��)����G];6�S�-�*��px��E��]������P�T��qZ&�J���{H�xv��k�����RJ\Bw����F�V.z��r�������a���y�7���Z��,{%W������xa����Eo�|e��HXH��K�����<�Y�e��[�T�p
*��������R!W�XV5��!M��yDM�B���CM�<n�)g���� b����_����`��c����X���E.4��#��8)1x����c��}hc�b�!�:�-�6��V�T��g{��bO�d|{��M���;��9�,��x��2����w1���f	+~I����8����LkC����p
����E��@���*�����2PP9�iq!'����Q<��pN���P���};5��5��\�p�l���<�3q	�P]w���8���������M����,���zp�%�*����4�7q(��tO�~�9��8�(c__�����9��5�x����B����s������j�a5\���:�������x���g� �����d9�N����d-9��TZ���9�i����dM�Jp+�8����������~��bV2>�#Y/<<����Tn/)��Q������~�i��X��"�	�H0�I	&_����F��r��6�����,�������~xv%]��@M���*I	��j��5'�hNP����9AIs���E�	J�4'�jN�As���Y�	
�4�xj�a��v�|~x(q�1���~�t"}������'��L%�,lP��A��E�,lP��A���,lP��A��,l������Z+��������'06E2��)
i��dw��y<\�.D@���CV�S�*�&qUT�r����~�',:�4z<��(�@-x���fA���?|:vQ�K���M6�&��x�&����$<a��U�F��iO�$l��i�H� {�&�"��Je�O���BpC�J�������BMB�9��#A 7���+�A�e^|���G4Q��N���������  ��}��s�����Kw��gV1@�9_W�L8�|]�2���m�wu��9i($C^�$�}
�OmB`�	������Z�d
|U�U�F��j*��!���t�%g�u���rZ_�y�����v���������f5_��T��={�'��v:dSh:2+q���Y��]
X����Y��^���{D5����8�t�����p�����/��>�@X�����������*8TU��*��
���R;��}���j �����F��0��<��g&�<�|*�V�j�Z�n��5�
4
��lM[�V����.J�
*��\�����>
������A��v����
�-j�M�E����� �:��7r1F:�HG�-�����Y��=E%w~��'���L���c���'cm�&&jW��U��^��a�G��h!f^��8�,���!b2�+"&1�!e&������mq����<��j��.W3*�w�
�$2��L"0��$1��L"��$�f��D20ge�E���,@:���w�l_
V�����@��=�N�u�F���5��@h|	�[�{
����
U���*����Z�iX��U�]N���V�8��Y����_��&��l�{Y@|G�2/eQ�:�8%E��(B�Cj�������������a3����s��{�������C�}�WA��O*5�>�Y��g&����S�;]]8��4���z
C��_f���~�x�C���Y���X��(�"���+��c?���(
cQ�V��
�j���&i��kR����"�L��uiguZ�����t���Vk���!��
���S�!
<T�������2Ek,
o,
e,
k,��X�^���'s����]���gn+U+���+����
���e5*�a���bT�bB����xEe* H1����n�*nzN�_s�4�/�^.|4h��!���}��Y/.�S/<���~G����w>������r�����v����]���������xbK�}�~���]n�#v�M{���r{w]��Oar���q;�
�V���E�����V�M��q����L��a���1X����f�����zX���������/r�!sI�`4LHL�7��F�"-�X��H������+U4O>�O����f�_}N��!����r\�
��v!1C��pui���������zo�L-g^����q;��>�x�_��tB��+�w0P���r���3;=�s�}|-<�X���nQ}�G����b��B�g[����]>0�:��������n�]����p�_�V����"6Qt�i4�
���.v���!��~mbf�iS8n|=��w�!�&e���nl����+�N�b9������E�x*�A�C�����P;
�s=h/�6!`x��2�X�_a����G�%C�j���=��� ��m8YO��d�~��o����&���'|1��o0C�o"�����i��W�������
�u����>�������Q���������qU�jt�W�fx��S"�+����JDX%��G��8��
���#3�q(�����g_�������T'D��
p�����2=���i���'�S�� �� :OA�<Q{
�k_b���X�[HV����'�vJ�qZlAOd����������,�V���b��uZ3H<Tr����X^����X^�v�������U�������b	nQHH�%�($�h$������h�!<<�����-1���l�#oi���������B�0�VS��/�B���P��
MVh�����
���#_T�
���.�u������aKKxKK��h,-�--a--QYZ�������������ii}�El����84&�O�;0wJ�������9r;��O���UJ��M�/SeeC8��"B^�	v�`�$L<��85y�a�R��x���_���YG�����5��rz��l��{���0D�a ,D��@xaa ,����"�G���5X�1
�Z������� �;l�S�8,�P�AE�������#�S�c���F���y1uK�XN���P�����d�C�����8Ww�pO��V�ltDseF�Mj���}�w���j��R.��"�+�����������3��?���.��zI���%���q�s�ze��m�4��~���k�h�|�S��O �������2��F_��x��n�G8���a��'{
�4����������C�#��H�c3ww�����[���o-�i�,���}O1=3�e�����m��ph�`�q1{����������"���z5����d��&����j2}[��.fO?�s�=���|]N��7��������U��l�W ���m����8�=m��Rk�(U��ajd�}�*�bzf����oHo���_��}�~��5��3u�v�N:�����F���V���~�J���������q��M:<23�P�>llF������?���L#��{�9�s���x6�Ngw���_
<����y�{o�a�M�������v�l�\����?�s�6��)��S�-�,z_��/^���B�`d�.���r��R�u�j��^z��%���W��=}�O_����?}�N_����t��z����Z�����g�����������K���Rj�,\��<�g���5�&�����@�-f:�lIG&N���mI��)��������%�Lt�H'� ,���E����T1�r��OmF��7�+]������U��m��-��8���;��@���/�������v,�w�y�L�o}�P��u��_��;�<3���|�U�T5QTw��]�#��u�j|���\5�`���
����X���s���$���6_{���H��p��K(�������g��_��O�w3y}�~-��ns��s�t��,k�f��*�����N'Gh�Px�PZDx�!
}!%�*�N$�8T��C
3[m��#3�X��u�>�������z���
z���)
K?����%*9|IBr�a2��Cmb������v�)��	0)�J���?
������z�H5�$X���X���p�p?t�P�
v�P�	v�p/	ev��	��xe`�k1V�=��3��)
z�D�Z���rX�����C�hGJ^H���)y;�'E`�m�0
E`�����/�}9_��m/6���:�55�yq�@�cE��D��$�������T�m�����#�v���6��"����2�0qq�"����������������Z�9�;���.'��t���2_���r�bQ�������r���r������3����6��_���d]������t��u�Q,ya���k�m����/{b���a��=�b/�`���fM���n����];�k}��.=6�sdF6��EQs�X��
vg�(�m�"�7z��h�0��7u�#o��^�.�EU�/'��L�j�'���@���g'��A�|������Y��s��/���W�D5�LeO�@��1���l{D����4|���������p��+D���P�����aP�k���g�+	���@%!���$
u������C�����P�?�\tN
����fyr�G�}l�4��|�)��D��9��������M���b�8��l��;���=+�9f�D�����k=�����
�z_6+��>s��#�ZG�c9���G��T�:��.:��Y)�����l��f�eJ��Lu��k[�G�/�]����eLHKl[>�b
��.awm�E�&|�4���f'������p�|�Y�{�|L���f����:�_��e�������"h���ec�%��T�|@�U�b�Q�m��L��k��]�lX068.���5�A�mW���^���	����o4]%���.s����������hH�@�����l
���5��dhC�+>�<��e����!���?0m��`Hf�h�G�mXH
/�����;S�T�R�1w��\�����g�KxfR�#��06�!K�Q�����
BaG�c�P8���`����0Z�	�"��>.lK1���������_�Q��T�P�'���z�}D��6�S
���dI��u�����������}��S
M��
�G�����B1�X���@L
��(�6S8c��	���@�������-���=��������L�ced/��jkN��{n�M\�W�x�������h��:1��Y%����(���|�G�B1
�y�8����R,����-���M�f���O�s��e\?�S>#&���B�|���;uJU���H*F��4
�H�wQ;_��������d6�����@�*}��O��	�>��'��	�>��O�������O`��6+�x���tu�1��^���gs�����e���I����T;
�����{���b�lJ-�����j�2������99�������E����
+��n�TL�����\L��)�q�#�7?d�~`~��������
�Z����W�LV*���E�u��2���C*b5�h@�AE�J��8E	q�*X�P���0��3:mN�
9-���������?Yw-A��^�0F��G�6���M\�ir��6,D0��j�������#g��K*�}2��P[���g�Z�����	���i�]��)���lf�d5���G#g �a�g������	��<���w�S�t��d��y+����?d�3�n>���]�M�dV�uU
b->����}	��l�b����C���S�)��)�.S[r5���q����	���C*J�S,�����l�H��~�H���q�d�y
m�.�mc'?���x�W�@Yh>�>�� t��7��sT�
�� n�;���HT%����hf��QA��a���1
/���3�����C|pVM���;<i�f��Q����}b����"k��5��OU���X93�W�v�W�^�����`z��j�D�����\�:��G����tM%��Cj��e���p�K&���2!H&Or�Y	;�(7���8���=�f�&%t�R<��P��V�_Cyg313S�����\T�}��F�g����J��uGN�>�4�?�J� �������69��};��`��E��
������oF��y�7��f��W��)��b���Wb9�A�����.}��DY��5��n���_��v�a�n,�s�8<*#x���xT1�b�
�`�Fm<|�������{��5�H����=��
o�|�n�w�f"\}��x����s�}�O�c����yN��'>��0������
_/	�U����/������2�^�T���V�a:2X]hO����V-�
�6a�����y�_� �<��`�4(-��}����r?�=��S����Z���:�&�����tc��
s�V�Da�D��@��P[��aU<SB�0�R\�v���d�M@��e�U0��N������bu�1������^��#����e��C���m�����W6�/�g;$ey����7�<��r�����;��h����o)�d"����2�d%�Ns�
�xC�!>�o��`�����7��b�
q���x����.�_h��Y_gR�q����
���@������Z��A������;�.�g|��k��n��J/l�$q_�������`mfm�kS�����R�wa;=���0���O�<bVR�����1��|�=8���<)�6�yR�G�"��Tc�),����C:������;��
�Z���&�z�T��K���z���e�0�g�a�����<
E��EL
�� �M���}~��u�VAO|��s�����T��?���]W?�z�oB��G,�T���A5�~e�~����i�+����us��
�����
sR�=S�\��c��N�s��n�4�"�Q��-9	���p�0
�K���"�h�����ao(B(����4=#HC+�H��������`�=�i��i��X���|��a��OY+~�Q_���x�R��x��g)��D,e"P���L)]Um���������
m#SG��	���A=��������w
pk�K�i�=�����2�{[�zN��x���0�#�Uk�k���=�U)��=�EY���T�Q����|�g�nu��
�Rb��gV�H1��'�z��u����|wPJ,o�>*T�/�(��&�=���XAc�0�'���]�h��c�5�v���i���=�T7�5��T7�k�f�	�%)q|�� �����B���2>F��Gc�tF�dO��P��m����7�Y�P�S���.�B��������h�3}�Ta�.�?��^!���v?{���46�����d�>`��������hz�n2�C���D���nv�����m����&�3Q^��4���M�o�ZM���+x�7��P������q}��3>Rc�����A���:C������]���*��=���q�y9�A�X�H("���""��H "���+!���@%$����A0sY��������8�a=�����X
�R6��Q���O�O|���������r=~����`��v����O��W��	T����L�{�� -���K�IR�G3'�<���qw��v�Z������H��,��}\'���c|'U�R�-	��pK���A�UX��	Vf�^�4�1��v=�}��to��w�|FHe�����x�����z�*��m�)
�������3F���s��\�{��{T1�sP&����]�7C��� |������?�e�����;Y���u��7� {'f�;�P���t$�Z��(�����^��yWC�{�H��!�@�A��m/��|,�>L8k��W�:8&G��n�]O��|�|vr�m+���x���x����8�V|C�D���^0{|�0��m��;���4���V�>�+S����j�����Z�?����j�����r�M���;�]�a�����$����!$�!d�����&�I������I7�������^&S����7�Xq���t�\�=&����7��r�h\tzX"�#�11�u��%�+Q��������9��1�Ot!&rE��v��X�m1{�6�^ t��Z�,�l����!���Gw��g�?��^�|}�'y��q�$�>~��I����v]\�*�;}��Tb�6r�y����s���V���f-���y�5���#�B�H�VV:��t<�H�G�|d��n��gB�]���c�/�N�S����4_�$Ie��oj1�iG��J���AUJ{�H�c�^���<&f,1B9�1���hP"-$��y$�Gg0fUx]�\��z�*F�;�(1��}^yh����c�~�D�&��/������g��VBIb�_����dQkMo�w5-�NPjk��Km��i/����K�0t�F�v-%p���d0��^���G37�g���=�N�h���h�j�[�(����o�j8�E�;T�Yu��	���Q!��|I1���hH����p�����I,)�tH'�u"��
<��@�x}���t�K�&B@�8��x9R�1%��LM&D)����$a���$7�4A����/��]�����a=��`��/����@�1���s��2�ex�����l$�FF��-tr*�S#���j����
����b�!��l�G�MZi���;^�	a�	a�	a�	a�	 ������#�dR��%�=:����"*�Q�E�����['T��k�q1�ax0W;!%�YT5������Mg�F�-/��_����A{�"����G)����i��o4�	9n������K�xV`V�|�����(`^����j��%����G}�.���O�[���&�����M�����E�i��e�j���j��*��/��P@�)�=r��Z�zW�g&���U��#��!#i	Cj	�k	Ci	�i����
sJ��tf���u<[�~0���(K ��5"�Q���&k�/'2Gj�{�xKm����\��0�3�*Z�oi���;���:���Z`It����������\Gc_l��?Q1!���o�6�D~�}C����?���c��t9*<�����^���0A����]ow�V�tFg��(6��)�!j�#~QD1=�����`�G�����j�Q"`W��^�������nE1*�|OW���	�V��_�8��M�%sN��t(��:�>k��FkX
����i��$��$F�$F�$F�����/2
Y�4�p�s���������c�M�Cg��S��ys8����?�1x��t�}��-.�*�n�#L�!J��13�|a�E����l��9�
�b�e=�e=���\����{0�`�p�,ZP��'
��~K��U�sk��1��Z���O�7�(�R�>P����U�ln����7���HZl��P�� �{���c����#��T\fb1�HcR��v�3��R�O���f�Q�AR��$��bU��yh��IW&�S��R�D���W�i;��9���=��w�K{��@VI}���������]��o�h>��{�s���{���2��b5���O��V���;��N�c���
���9qv`^cy�|�n9j��-�AU����s h�#��~��L�i�O����4;�z�q����w���<��	���!�WB|(�(���kY���D����e�������}��8��nCj�,��H��C�g�Y�[��"���C��0z7�F6�t��]��(
y?dB��i&W�6'2��@���u���!,k���:���@V�_�B[��W��M0|�>�����L�V�?�}���:.$��$���,"`�{�f�}l����A�P��a���W�YS�3�Qk;/�N���F#�J�A�F�y8��� �M��+%C��D\�����M�������`�������b}��P_�q���,J'� ���"*R���F�z-�F�D����;qy���k �A����!�T�Y��)��k9"oD�����[E�`�e�4�h�����A�k��2�G������O
��h��*F��e`]�0�c����#�'I!�������q�4d
qM[�S�
~��EM
�����^�KWvb�����p���V���J����tk�
��c�d�?v�v�	`6|���;�ms�/jh�������T���~1y�t���5�(����^��nIy�t.j���{���f�Y�������w���0����L�v��/O!��<
b��f�1�Q��y~]m?�U�0f/�~���D+(��������(W�IW��98�3���v��n�*�0�K�"@��[#�\����\L��>�s8���b����=������&-	��3�~q@l�70������4l&������q�P�{�H��(�m��
&d[�F������p�WM��#+��3���j�f�!����z��6�lN�l���L;A�^'��x)ywM5�t��f��i>=���^"D������]:s����X�����7�-�s�{�l��b>�"#x������u_�l�o��&
xib��Ti�>�8�9+�����ew�.US<�b$���w�������������E���u�����MR���ggOh�t�tN��Te+��h3iT-����G0TX��x�1NV-�5�S�F�l4���p��D���U���i������dQ���g�G�4�tq���(@	t?��p��\�n�z@��>���������X(��	���!p���OM'@%��D[]+�X������T���r]����F
���NF\#;�$�R� #/�x��Ab��3#0�����b a�x�x����>�GJ�n���u�]�����!�fj�f���z?0�����9D�����8�����s��t������!���8��}���
_���)Wf��"��5N�xA7'�����F�s���7n��q��z�+&�*]��bk��5�wR�R���^�>3`JnsUG
��.r��m��PK��6GM��
�R��_perf_reports/12GB_preload/ps_8_workers_12GB_preload_5.4_selectivity_1_task_queue_multiplier.txt�]{��4��O1:�m�;�IH<��8�@E���-�&�$�����������q�t��?v���������6|���z�dQ/K�A����^��.8��8��N����������,��,*(�L$����B�s/�`d���ZbZw�0},��nEYM��dY+��+����5�|����bYa7g�\6���/��d:�|��d�J���y-f����,�\���������l��������4�*��|�������~�<�_��Nf0�����}����7���l����U���4�n)�����
.��#����[��X�y�@z��D�����h?���'�+y������n>��tc��y��F���G9W;�����<�vpZ�x�X�>����r���n~��G�~�C��,����������i>)V;���������d�j1K��������x�L�^&�>�E_A1Y?�wV�,�Y�EZ,����bQw�{/ �h�Q8v:�!�]�$�O��<����zy�H���w��@�^��/�6a�/��k_��/G��o��3���U)*���2�p-�������Z�I�M�������������5����r���Oc��G��r.��Ey�{1�+1{(��ZcW�G��[��5��F���k1�n�����W����?�X.�Y\'���O����a�9�F�
�Oob�01_�q�,����>���ur#���D�u����Z�(WM��dF �R�j�j�(���3u9Z��x Kl8Q��,�r���l/���\u_����"���i2;�\>�2�ZV�)�r�b��8�@�| O>'H�����q�A=�@�|�������W�i���h`���B���%������W\�2�Zv���a�����X���5	*616R�q)&����e��5NuL>�WK�$%���1��L��V�=��V�K�������"]J�����0��O�OO�m��0WOb�@N��������Jec��F�
_�G+���/��/�M]g���#W��t�q�'h������U���31)�y\/��;�Q�E��O#�+�"_)u����+?�V���p���.�v�4H2I�'�d���bIZOo�ZY��������G	�����[
$e���jC������lX^���������j��V1�XU,����T�2;�����@�V`�(���$����y�E�:��q�2�����q��m2[�,_���K?�4��}�d>m���!���G���1�)T��x�
�.�/�kQ��Y�/�S����W�]��x��F���@���������K�	`OCx�u�x���n����y���(�i���r�W�1��t���[A���.9,�c�#6;c�#��:�����'�`�t4����Lg������@4��l��HK/�4�B���FP9�lD^�k��*@�P3�8:��w}H�,N.+P^�!sCg�v�a�����*��������5�I�Y��_cf���������7�Q�j�m�%b�������V�����&��}'���aJZ��*�\N&�l$iC��,I�V����2��&/\{�'���������JM`�\!#"K,fde�EnK��[�h���Gm�������JZ�����x4�fF8J�L"���[V���
?�x7��S��ls%rQ�ym`���Vt\#*&5�M�f�-�5��;��j}�4~3����s����>${������ec{��St�V|b���"��n}JEU5{�����j!nh�c)Gz]�&�&��V��\.�>���"N��6H�4�^����a�'����j��Va��<V��1Zy�q�m/2+�iB	/E�uR����������h?���4���c��;��<]��F��
.�����`�� ��)<�x���`���r�K�,��)��
f��(�BY�'uT�/���% +�UT���+�h����/$���>��-�%L+[&��1�*my7�@����S.���t�-���������lRi�������=��=��=cJ�M��S���,e@�L�}��4`O;rL���=v��C�]6+�v��*���L8�Cp��G'C�p2�W�!�lt���w��}����-���<����I����=B�`&d
!SP�>��)��u�9d
��)h�LA'd:��E��n��B�yI����x�qV�@�m�c����$`>���z���I�A��h������b�p\�����h`����`�+y������(��e�>[�t��.���\����W]�sz����
{k�G��'-6�~�te�����:�����S�;���cn:�F���*O��u�I��J�W��A[C'o��.��u���V@�F����%�mF�X����[_���c��z`�O#'�v��E��Ag�"<�S��,�,�,,�,�,^'���.���8qQ�mC��+.�����\fW�na�+73�e�������������b���������[�5P��l�g�>Y<|�
��t�p;���V���{hj�DPx���������q#�f��>�O|�����d�"7��MKZ)����*j��	M�:�����������"��aJ�R���/j����G�?O�:��q\�s%��~�j��\9��==&����I4���f<<�$����������
>����8e{d���5��V�0��U�v��l�a��}l�Cs�)�
QP�]<���������M��48��Nlb3����NT"��o�EMc]�Y��V�A��{�Q����8���t��0����X�����v�^�vh�y���f;�=�xq_�������"������()k�LJ����B���))�?CL#������1b:��t0�����1}{�0��p��p�K��iz3�x[�L,S����� �u��#�u��FJ�A��@�@��-fFV0b�LYw8mE���$(�������L7���+�O�\����������\Z
�{����S����S����S����S�����@Im+Y����l�m�v��x�q^_	�#��8�����&���I�T�B�����'[Q�I��?�4@m�DC��)l���<h	<�S���hpe�W��)�B��T��Q�<�o��x~����a��U��K��@m�	h�9�k�����\3�f�C���M�������N�&fku��|7r��)@�����O5�i�2�IO`p*����4(W����7�q��:�@�| N>�&�������z��4��:�Xv�=��	�$�=�����������\2
Y�����Cg��>����z�yqy��a^�B�?�Z���"���8>P�H��������E��2:�W�p[��^������oC�yX�h�>G���&�I0�|������LO������ZL����-y��m9�[���hk��xp�,��y3�h��I�!��	���?Dq<��b��RKT����e
�pO�pbu.��e6��526V�aK�����L���d� �R��59�H������0��5p4�����i�_3��04=2�G9D���0����J��zW���9��>/�v�1���_���Cc	g��z������V~����/W�^�#��?���x�p� ���
���9��f�x�s"�����	1x|1X�h�jZk��+?�:���������K�H�+��jx���9=���N�s�s���yl���x����9�%C�� ��
��x�*�w�d��s;d�=�$h�x�v�At�I��Wa�.����j���v��v'�n��0�������2O�
p�s�����z�UE��?�;���k'><�-q�w����|�s+�<Y�bE�i��=Ya�'�o���#��&�m`E���x�yzb�9�?GO��sb���0xVp8�(�1Z����x�L=���3�����c�Z��F��8l�=�W��-��4�F�+��O���l��m���6\�5�����e���M{������~�gA����������a���������o#��
�,:&���:<WO���^K�g���=�o��D�G;I�3Z�n���Dz	�Tz�9f������`��p������������tV�f�Rlcg['7O�!Q�g�6#��;o;t�9�40���s�R=��.��Ks���������t��9����K�'c%=��J��N:����x
�+*y!c3���2ob����������
y�y�|	���O	�AQyP�FM6����C����6t2g�Ib������}�I{�D�Y��5��{�L?��	=}Q� ����0e����6����>��Fh��P4��������7��B��������;Yp�s-�0���v��I,�o�8�{+1��^V%�]]�&��Z:6�aE���@��c0���X�(�M<<b�
yzl��|b��'_P��V�1����N0aViSI���IQ��J�u�B��FU���]��8��Y^},&�m�������������i�]�f�h���N0��S�:��]�=vh�	��Gy,��R�n"��>���4�n1���?�&Pn�����~�7�r~N����z�}��6�=MgR?��Rr	���"p.A9<XeI�����k�	��N0��d�>P��Q��v����_����4|��������}	��^���^��d�o�h�����@f�#+�Z�I�~�)�vo}tF�)x�������Z/��b(�~��w�C�����
{U���
�cWRp�e]%J���.lW_,MmB�`We�p�r��`d�]4�	v�50�E�`^E�]|�
v�u8�5�X��]�R��X ��B����`���.�s����~To�������%ym	��P1���(����FGc�h��\0/�5���D��d�Cm���a��x!W���c��s4~���AA�.zcP�����\�%��������n��GQ�uq�G����
��Z'���eno��m�w���f���H�������mnn:��}p��d�bUT���������2����.�j�r�v<�w��kg����ff�z�cZ}�#e���r��)����,�"�x���)����O�-� +�@p�l0�l@�:VC����8�1�f#��6��l#��7��=��XW��n��h5��Ff�Y�jdF���FV���jdV�m,IV$D�d*�M|h�h� ��n{�5�C6(�53���%�q����!����9�v�����H��,�o?s��6v'��M3s�DsFH�(�\~uk�r�����%U���O�S�L��F���Y5fI]�b�]��hE��"�
�IB����)�?��Y�y��i;��n����o���Y�I�<�d6/��otZ�*������
�BGM(�#|��or�f�9~G��l.��#�������>���a/���zX��b���������2�4gb�����������%���������/�`Y���.*����2��&9��f��I��N�dr�<g����:>�:
h���z��������t9���������5p��j$�> �@����U���~"1-]�"1�^�4�M]��2)��-�Y��Ox�mch������d�K��/����-��$V5��D�n��o�^�nnv�4i�j�rx�k]%��ML�Dm�2�}a�3[d�����
�U�����k�}��-[@U9������p�D��p���uKw]Jj�y	
o���Oq��:�-'���JuTJu�&ER�XQ��P��V&�"�^�p���Uq�m5���h�<��y�U��v|oT�O���M����EO����AGQ,S�F����������r�}O��?�=��.Ks��<Ks��0Ks��"KSW�bh�:�ud�EH�PH�[H�fH�q\�����~���0�1�&+H34I������D�eD��k����bQ���7�������FV�F�FZ{#{#�����Hao���������{�UP����q� nD�^0=Z�q� �G4�9�,���Tn_��j�S�T@��0�I���z�|���y%$����C:Y;�p������QFJ������D�]��@��2����	���	�I�C4�{a~E)�Rs6v���q�11��4o��p�?1���!����I`�$1x,,�,�,,,,R,��v�%y�/���n���|xy�^����l�Y9�ye�y���>���N��'�M�s��\M?�6G�.�:e�HC�=��@Y
����4�>u�����@�0�-�����~#-*���$�y�jz�k�>�5�6�����Gz9<��W}\�o�a!k���0���?��8�}%�(�K����Uf�b�3���F����<���i�<c���c��eB����hB��@�k8���O����K��3�c���>��W��n.����1��6H�L��N���5,��k@s��|�������h������{�Qc�S��S�S�S�S�SHl�H��s���������9<���L��i�D�N>��Gs��-g>E#��������1
[�����o�����99�i��*���~4���Ug.8��,�mW�L0�s����������+`���r�?z�,�6'�p�8p�k�Z��:5��sY�������G��_��[�o>����}���2��8���?�����K��d������g����.
;�Y�1f�E��	����l�dk'�X���RS8%�k@0����k�����]7��O�x�#+�Eyo����6�*�����P��}�������y������g�:]�j���&��������[XktT��?�����#���N	|��S���?�\'�c:t�NP��R�y�,���\��;����T���MC@<�N�q�<�Z.�S��:���~�	�<f�gUpU�g����m����2�~��3�3r�1wY�	C����}x�3 U9��H ���
d�A"x�-�(���a�(IQ+r�Y���G���ok�y�����cT���i��Y/��:�{
*������;s���>�/�����j�CW^
q�
�!]E5���/�6tu��,�����
��8�_a�-���v���m�p�MltKCce>�� 5����H��P���e���(ML|LLL|�C\�C"�FR�F|?F,��6	���rr���.�i����I�c��L���!�W�~I>3(H�����#�8��0�DA���3�a��zD4�h�����RTm.���P�.�(a�A����#D$�8�q7� Q�A�����"�F�8~n���������������}��'���9�(��K�>����s�������ph9�F|4�H�+/'���
����
��C�i}9 E�S9?'�`W��hpB��C��WT~�tat��� �E�+�X�tWPaUS���5'�H�1��1�O0^I*_t�v�,\e��F��hk�����<(��>u�#@�	Tv
k8AjMt�,���_<�~��isx���s��������������N]�w��~)�a�>�6K�Qj����n�\x�Z�'�J�������
�E���� ��x�!�q��n������<�S��!��cj:k{@4[���4�_�������H���:��+��y���s�	�1��p�"�-7��8�q�W4���<te�9]�m�C����`o��+����;���|,O�4�Q�+
I�����,���!��4�7G��4�M���t��_��91�9f1���ml3�#�u�n����	@�R}(��%j���f���)�U�L	��h0K��"�uw6�
$��E��/�;{W��4���\�Z����X �{U���B����M�����A�S���s3�7��L�:�����IWp:�C�0�~0��Q
�0<�q<���5�a��VF�s�(�**id%gt��-�8��S�9c%��#G�s�'*��^��>��>�~_�.h���������hgla�������J��B ,�#�>��R�'��W(��y�x��	�@	.@t	Zw^dCG�p�-��u��g{\���������X5�GP�kR�DE�cD�E�
��Tx�"�����k�_t�`v���H���&��&��&��&��&��WCI��$�P�k(�4��J\
�G�2��t�r��A������f�k!AeA�	h/��9��zt��&O����t�k���?�~Z�v�����%�o��7��������$5{���Q��I��L�h�[�*1��b���H��&�s�q����*�'J�{��	�	P"LsuN8M�������e��
Q�C:�v��c>�����}Y���aHR?��m��h6�������flN]9QE|>�.������D��F�B#��p���7pP�G2(v
�o��d���2�P��|�O�	g>��O ���|��?�Q��V\k�v�zf2*��m
�4���k>Fg�V��V���#�8�����U:���6S������l/�t����l}�������xg���+7DAc��(H;4�*>�T������U�=��m-N�b2��w��D%���_��}i��k�;ByJ0O	�)A<%����)�<%;O	������QW�������6��4�3sBg��M�y�_V�h�QG�k�
3o���;U�mOQ���y����9H�m�8^4����)Z?����V���o_�^uh�8�9��dU������u�DQ�{�����Y�Dc�"VU4�CvM�N�f��M������3l�=����h+5��o����E���������(v���@�0)��a�U���R�h�@���I��%.U.3Ce�!����>�����|�<X����7��8pV���8�AhXB�������yxX<<��?��)����~��z��x��������M�,�����������}���~l��t�����k��$�2A��������������8����������o��oNlq6tFg�S�M�"�fn"�����������4B�gx����[��8����!8�����F�Ql?W'���r�`��}��O8���}�Ov��}��	nv���L�T4���uH�!�]�G����:��8������V7G�Z�Uf�r�f�Yj�M�m��_���dpk�����0��Q:}��1Z��EV�$T	V?u� E#\���R�����Y�|J��)����5}`8���>����yJ�Q���@�D������kj��t��Js��k ������Y�R�6���[����4A�x�W�#%���?�E�{>z�i����Q��=a�H�3~��*|`
O�.�(��J&~��]����$�G�	��#xe���QY?J�+o}��YC�`�����$'��
_2���sM��`h��������gh�9S��7v���e0��q������c����'�{��8=}��������=�^7H����8On{X.y��g!oy��g�����5�[�H��!���	�*#'(�����@���>���S�N�4�	8]@��$rG���B%��Qc&fb��b=C���@�%J-�d���tV��l����X���8���uk��q�s��@J�$���|R�Ej�;BB��%5���>������2b��0	v�n�;t�1���5o�/Sx�0U��k��h���h1��?����E�\x�WhD+�
�
�r�
;&����2*���b�j�Y��w����K�>��ZcZj:��$����������P�;cr6sJ����T���CeqTF�5��6RQq~�,a��|�nQ�2�8zK��@����TE������U�����3�W�N������n�'j}�;����l��������l<�3l�����6Q�DB��x������=c���E$r�H���
�v�C���Z�J��#7�t����3�L@v�b�+'�L|)���5H	1|�����ma�HL?���I������������
\c�5��:��Un���I�zI�"I��+.��V��_�!zd�t�:L[?����Q�a*<Of������������K���S�����������-�����=M������@%���	"�	"�	"_&�|� r���^��� �e s@6�R}���BS�O�������b���Q�Y��Hd�$�">�&.��i�Xez�S�F��`�$��M1��?
n�]e�5�S�89��F>�F�!$���$B�`���0��a��a��^b!\�bY��V�z$�*I>$���"q�6��e�9p���y��s����'���`���_���*��Q�!!,J	j��
��
���yp�]���3"n����&#Vn��bD�,��w������7x'k��/�bO��MK}�hkN7-{������A���K�a��+��T�['[V3}������T��D'\�����ZwmgiG�9t9�T�����*k+��#%�����,w�M�,��?����'v��F��L���{����������\�al��[5�������~i]����O(�	�>��'���s�X�'��d�>A����H\�i���j:N{j�-������������v����z��Y��h;�t�WiZ���Z�t`r����9
���}n�vy� wA������G�f�������� lu��
~ZV7�Y��'c��f��t�l���
��F:����/��;@g\�x��J�������8��37H�b���N�qcl$6�Z�+�Z'����d�����`��l�����n6��z3�Xj�i�y6���$I+x�ag�N%����f����n�xqvvDg���������;��f��rD���af�'Iwu*X�~����l�Ri����r����q�{d <�����:[^���-
����h�4�@�&T�	�hB5�`�&P�	�h��XJ���]@	PP,rx���\��kF��"�t�c�g�R�W�3���R��C�v�X�x�!"l!�l�!��b�*��<g���l�Q��[�a���^�G����]�,�\`�������m����+���������	��������N�gA=�z��G�,s�gA�^0]�%��%:kW�[x�������F�/�Y��Et�����{������}3��0�������)��u���j��o1��r��
�Q�a��G�p�x��'EwT9
Kh?)�<
�<�dL�4	�	����D�i(��
/6�D��j���K���'T�7;����`�{���:��~���!o�)-��x�UYV*@��������\�@`�gr�u�����4MO���Q��y�����'���l���Oo�@!
������GPr"��9�w��k�P�E�*���e9y~���$����HST�
I8-*�A����y����.����C������kc�1w�������|��@�)��F�;��a���\_���VC��77�
��K�L���'r(�CR4J.�Qp�jH�l��T�5*w�X ��;!����<(������.�l��vP���d�A�V�����p�w�3k"�����g�#q��#K���g������/y�v{���"D��Z_;O���;�@��E������0q����)��H]������#\.���-�,x��,�b�k<�n3�H�$-@O@�ia��Bo�G�cQ��=���A�P,}
���lwi,�G	���@���\���g/>%��)[
?|��������5�w�e�u-����kN�x�?\>���c�������?$�]�6�V���d��X��^���cE��1�C~��]�2�.���g6kBTp|H�,����w��;@����|$�$�����w0�w*����>)�;p�*�Q_�f������V��u��,�7������9FQ
�u/ve�>����������_��-v];����=�
+���s,����1�cq��;S��L9�PS��w8�$����w��;@� �P|x����@�w��;[W?��@,��p���g����g�4���
���]V�"�����&�l�N��
��W�/�O�BI����r�B5�z�1�����S�����o9���7yV�
�����%�Dn�-W�Mq���iu>��� uN����=O�p�K�����*����q�u0�^�p�#
����T�jR!D�"�5�u���8g�k~�\�Y��*�_|�r�@�����n�RWf��-�����
8��������dr��\.r�wW�q��s�$d���,��6�cSW���(�G�j�S����l��&�iHL��(0�R�K����<������?	���?	���?����?����nn�_�h����"[��N��7��H�%�	��(���%.e�r��� %�-���zG���r��^�{$/�MA���R�~��������
	j��
	��w&�]���*��	����	j�k)
�@�Uigy���f�A�����{,?m����.������pZ���&�j���8�S2���M-��[^3����[\Ncp�r��C?`�@_�W/T�p
H,��C����P@r(�8�
$$���C��:�Mn����_����W���������.����U��c��EN�����Zt��P�c�2�
0?m_5F�-�8�#�A����Ct�
������%:����b<�F��*�^��05�����2�'�b�L��3^��7`�[���|/w���n,�b/;F6if�����<��#[B�QD���YhYo�X���!��q�����o6�m37���A�t���"���]8�?������bb&bt1�����]�$@���P>�g�����D��
�\6��8�%p+�E�|�`F��� �dK���8\[�l@�Er�f�J@�4�+{���%�v��t��}j���0pQ
���Uo��N�n
�W9��bw��x�s�CM�Nl��������"cXl��v����_������sg9��Sj���=��=�n�&??c�H�FL5W����<����	�������F0Vz��2����H��������'�@��mlH������h����j�`��W��>E�K�bw
�U��1���6>�(S��6EX��6:xK�8od�����>�v���o�Y�d�,Dw�G�>���@���n����UT.�{|�y�}�_�@�4y]��U��6�_���������_�;����+�/�+C�s��g(n�����o��d�Uo��/��������0�bT�;fE�iVi*���
t���P��������`2caD��=����Cr'�(�4���n�����.V���!k����f����?r_]��v����=�����n�q���I�(%�g���|��.}-��.��)���L@b&@2��	P��J�,��|{LnO�Fh�N��?$�}|8v�#T8m	&���Lc��t�B���.+_����v/�f����Ah�3���F|d}�<���b�����*���*���SPi�X��d�|���o
CM�������sO����� `��k��u>�+;��My��7��Y�7BI��X�R��
���]����� �Z\��� G�f�Hi�b��C���<J�&3w������=�����^#T�
Vk����P1]8����+?&�JQ��/OW��Y�9w��[�.o��G��/�.7p�#'�0d�^zj:�\$��S�H��>&;~\e������`�Ur�n��]6
��C��ql.��I�Jr�b�-K>��=��
A�tu,x��O������H���z��_qC���GG����G���pixL��&�bv�3g�}��9Bgaw�X,9��s5�S�8�v�%�;��?��<���r�����#�Kv���U�N7y��+>��Lw��ziCN���������:���s��+�]��NkV���g9�&��q�$�G��h��7������\���t�B�ei�q�\0���F(����g�h'�\�k��f�-�ASq���:M���LX�����%���_��\������c��&Y����o��J�A�8cU�������x�T�%��j�&@�?M�^��?�	PE}���#�'��e!o���l�M����A�~O�Gk�q���fs��d�V�`�*��J9?�=��XU|RNc�7�}��%�,h�N&l2A�H��Ly�Fz����ZQ\F������u�(�~}i�n4i��?������I�����0���ty�<U�*V����w?*��HfT�nko�h�A�H����I�h���x�E�J����F���/����p�W���LR���%gyz��3���W�U���<h=���)x��X%DM�]:T�@A��c�qj2�w�.MJ��n]86n�}�IN�������s��Z��~�5��y����%�s��Mez���"N��E%0���$d^�]g_i!72��^��������>�s�]jP�6��{�e�&�X�C�N�p�u�H��g�������������x~P�@�7�m=F
���i���i�U'�.\�
��_[iqUT=�����\QyI*%��I�P�
E���ES���:.�d4����8���G��<e�����r�A�������K�u-'� C��j��
����vQ(x���������N�G��$��_�c��7	wrx������
r}8��2�m�j!�l6�y,��n$��"��g2�="1U�j/�,bJc�v�H�fGG�?�6�r��i��9�X��f����X_I�"l��>l����6=_m���~��1��������������B��W
L��U��y��6�mGj`�
z%5�A32�:E�����q�3)���"�:��H�\�~Z$������_��_�.hP9�]��QmT�������L��������U��gn����_�=���~g�^���^�oy��E>w��
�$c�c��J���<�,q��(f`$f@303k"�63,�o�j:�;nN�8q��4�ZU.���cm(5��S6
�cE��g"\?������;^���H�A\�N3�i�|���	�0�"W�T��[�_z=4�Pp�p4�_Z�!���4�� ��1�ndqd�O�<I�^� �g(�~��U=��q�]�=,��o�K�Y^|�<���[\��7�7�����X�I��c��wi�V]r�0�<��"l���T��,��6�U^�Z�N�pL&K������'g�EfmDl��(h]e�����3��w72����$�B1��s��v��c��rv��u
�O �K�U� �!�����FP��bp���_A-���TK�a�D�W�2�I�~���mv��U�7����VW�K)M���]������v���Y����d�BM����c��C�;����Y�<�/��cc�s�PX!����Nr��y�$2���2����LA���n���^����L��w��Y��7�����c�
���!8��'��k�+)�`���4S�>f���>g���������k���z�`�~�������OW�!���������7���Q��&�P��$�CQ�%&,��Ge��fA�_��/��?�y�}�������9���l�+�����s
N}8����%Y�^��Q����57n�����>�������6�2�{��C��Ql:QcI^I�1�_���2������i�&?� ��@g,&d�)8���Xw���D��)����7;�C�u�*����4��_���5\�Oy:��x�����{���D-���Z>��p��c�FF	q����F���S�m�=_��q.�v5�^`�|��f���_�M��[i�<������!��i�{B��:�E���Q�������<�J�Q�?w��H(��z��./L�'���*{S/��w|�;1=�������$��
u8B�8`�_�M�&i��	�� n�7���������#���Tu�(�`kN���X�j3�v/+1G��jx��$��AI��
p�A��S��I�4M_������b�X*V���I0>�`D_,-�.���!m:L�]q����$U��T��:9��wl�lcZ)?>����>`�������|6���7gx0	�_����"�1u���f��p�y�Fv�m�+�nT�\�.k�^��.���8g��7�J�����H��R��#�z�����{��{��q()���0��;����<N�b����A?[�
��9���!��2r�� !��r��p#�&�d���!�hR����_��x14�n:L���Q�%�+Q�xN�S2�
6'��p������N&����-��a=V��SMO�O�������O������m�A7�(��	<�{�=����`O��'����#�{����!��&��^�@N���).�������Db�:2X`k���M��lwf�t(��n��H����!�h��  ]�~�(����<���*5W�7b��_0����(���u�F?u��L��m�y�n�"�����93h��������6��>�wv+#W@��|��W@�+��P�
�||�_����7�������LHM8R�t�*��8�������-9����TYK�4�J����<R)�o�Bg���9�S9�6�>�d`��(�ul�6lJ����F���0���!;��O&H�BNe�1��������<~g�>[
��F��Yy[k�$x�@�70��s���4Z%J�,�ErURG�<�A`��A�4h��<f��6��U�E��G����������P>P����OpN�nJ��)���J�{%�����>"5L��f`�h����W�C�k��t���p��X�O��	d��@�j����)����4�3C��5��BG�9����T< au,���^���=��1���9��mr��6my�}1�����,l��Q���O�����[�jC���D�i]>O7��D2�-M�V���)�Q�>:gh4�!�u�\Q�)���a�3e�$B�)�s�bU�2��@.ze����(Rv�^��5��I4��
`a��76��
��+�np����FJ�\�����n�7F���5��S����X���R.��3t�o�u?�}�A������'{�en��s������������^�	�<�@�qA�4��~�����B%R�],��|�-���~������;3q+��y����$Z��e"�Icv\�}f������xq*f�Qk�\��K��=J�"*�e@������F1���M�HJ�q�f���Y{��H�E�/����|�I�\�;��b���z�L�8�<��������p��z���D���^�,�����0U.&�e� ,.U�p6� ���Qy�0�*eFUr<�;�0�e{
-,��\.�����t�y��g�>���|h��gn���=VE�M)����?.��'��B� yi��&cw���M�b��}��)h�7|���}L���$�0	VL�f|����>8��;D���,Y��x���l���v���+�{,��C\^�Ez/����{��T����>�����@h7:;�_lp�R�b4�F��#1>��i�^�q����H��6���_����aiGR}�My������{f������S�%i��j��K�:w�#����D~?�[<~��t'�����>$H�������_)�B5^�'���yn��{T��LE�D��'�{�{�qq���2���S2���\�F%nJ�91���Qc�����Yop���:�M��9�WI�����8��Z�A�%uR���OHy4Ts_�Ts_�t;�;����"@H�W".4.�`kD!��c��8[�]FY�����A���S�a���d6��{l@�Z�5V�����u��S�g�B�~����4�/EI�q����~�l��;#4�W���
�,`����z#�D�*R��G������~�����t��;E],�S���.����p��K'H��.��`����#�u�|<��Wy�}OmW8�~O�y�l��u@Z�L�,��<;��:9��������kDx;sp;1��-5�d�[��������4*7i0����-�,��j�:=��!��9��"���G�wm�	`rB) �09�N�3�BB�"0��/S���e����)������[�s�q+��y�M�����O>+s�m8/}��#hJq�i�y�����YE��w_����Q�N��R�5Q��!��p������V��<^�����k�|{_���y��t&��_���<�	�������MR(y����������&�7Y.Q��r�6F���D>V.R��4�����(�X�����H�@6�)K@g]$AH�gW`m$tv��g=e��b{J�]�f-�"K�nk��N&���Bw���:(���!��SA�����ZJ+I��������C:��;sM�E(Py�"d�KQ���|��*�u�0����n��qB���������Z�;s�p������8;.&� ���$B�bCI&��P���V�mU%�����R��������	�3�t%s�����N�	n�/��&���]����_E~�s��F��fmJ�N�}H��E�f�zM�j����}'�V����?��������E��U�W�Xy��+�?}T�L\t�L�����*�TT�^�[�l�C��K����-���\�,I_�j���K��������������R���7�B'�w���nE�,��1��3�.���>Z�G���c���.Q&_q������� n)�:C&1pb-��<��b5���j�jCRb����	��d�Rq�4�:�����f���p}R�����l�pNS���n����m�D	�F4��e������^��������^�N�D�w~�I�H������;�������5.�\�jSd�������O�K;G^. .P�����r��k}�os��>[�L
��o�1�[v�9���	�P�F���I��B��.�!��9��	�l�I�2*8k�:&��c�)R�����ahv����+�Cu:{t!�{+�����H\�HW`OR���e��}*?�E��D'�U�.w��E-Uro�cJzk��e!yu�{�v�&/S`G�������h���5��--b�n�����l���)��h�"Ps^�f��[���D:/��t�H$�\Kq�����
�<���'��o�7�����y�'�Ol�����^�\�h^�Q��:Q��]�������p��/���UV����������`,t�n|�������5�1��NP� j��!������C���;[b�XK(�oO�rC�L��Ur��1!+���|�q%l]�y.��U~7�Oy��Zp��*YF���VX�a��]��C����~����pv�W�R�W+��G�}�����L~�������;�x[�&�]��#��������z��N�����"b���G�F�R�l�r����(d�3XH��.@�'���)�*�|�]K��������;�?j���8=�V�O��~Y�X��I:�����W�D�&���D�l��#QF/�����b;09�� ��{S��^��w+�k�������+P�Q���:ct3fZa1a�d�m7�#4�9_��������}8�%�%#��F��X�e^��+��1G+\Y���	�J�?�)�
^��"�
�g|�I��9rP��s�����AV��&���*$�������������Yo�"���,7��r�K���m���K^�����FFJ��..�s���L��9o��"���6��Y�Ud�o�-&���E�Y��hZc����~���~Y+���9�\����cF;��
�+�5����k?��~V-��%D
�����j;�fNi����)</�L�E�|6�8I��q��dRpY��l��B��y���:�/����X�*m�G�+/��9DQ�nh�]�:d�>M*�+T��.���HE��z G(�[(��H���|$���1���8r�Ww�9�s�kQ;O8���v�����aQd�U~��\3X!��!��Vk=Gf�tB�,�����1���P&WqT.�����{S���3Wx�����~�]\�����~�SQ��ap��G'$����3��sJ[:!�2eU��5�nn}�R���R����VEQ�i��Z<<�J�������7�\�r��J�*.�j��8K>N'b�)��A{���w�,u�=�������y�@!lcR��� ]�o��z���{R_6�Sd��^+M�k�rl�X�M������3��@
�����%G>a��|��(�f]�'+)��J,�4p&�8[��?NK�5"4i�at'���w�S^y��'=�������6��-c7���q+KHz��z(op�g)rNJ���<�M����,�J�l&7�%Y�C��^3�NJ���0ayG��9�S�C]��as�.0k�'�.pq\�cQ�T8�96�/��t�9�E�hI�C6���qR����BOD�9�O#�3���4�$
iVDYl38	���pI����I���������\�P���(1F�c�5��p/��?3`�5�@��$�o���^�UxKi
bI�;����?�$�AKzO��b�0V��D~�������-�K<�_��L*��: ����K��b,��d!�N��W\��d Ow�p^��&_S.�q	L�$��5��1���S��M/�'z���
xG60����qB��
=��8�3���������kb_L��O����u��x�.��s�9
��ba����
yL�L��!%����
8'������m���
H�d1q�	{^�<�������Pe���DAwS���F�/L��g�U����@��[�&T�Vm��Z��-E
����/_��E�ew�V�lQ\����}�A���z��j1�5�����
����AH�#&���x�7eG?��PK��6Gi&p�XEhJ_perf_reports/12GB_preload/ps_8_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt�k��4�;�b$t����y�$$��SH ��l���m����.���8�s�63N����v��O<�g<�������6���Z"�<{?��K�m�������mZ.����*�^@��_l�1/�s��g����}1q���!��qx��`��(fp+�2���u�D\���ET��Xs!+���D4�i&������m��7��/�(c��eV���?�?�<��?��'���p<�>D4|��/�j�	6�\d+�_�|�=w��>>��*��B,��^b�9��8�7���d�����x�2���_�q^$pqK���q&�����*�\����sQ>��~)���!��Yz%?�m�Y�_��Wq�C��o���"��*i���j%���sX"�i)�t�<O��7"��h~��G�~�C�C�����}����@���f����x+$���8<���r���}��7������K���_A>[�;�`��>����}7Z.���x���X��M\��������$_,�?=,/�#)�]�.�
�����2�c�����v!>~X�=�=���,��W�(��e|a�m�y���I�����_����������5����z���GS��:��q&�?����<�J1����w���������c��o���k17n����_��������\�Y8����G�������;5��E���	�?�@T�}X�����W��5���*q�^�^e��0��xU"lZ�������z�ZI�J�p��U�����7�|.���^y_����*�X�q4?=.��5,�L�I\dM�7���1�$��(�!v��a!f���,$F������J�HUIT��z#h��~�����;Wa=�g����!s��������a��U�}|�5(�*�WE���7���ju0����?e��@%>��D���@'>��D�C?��D|�&~SmbXWmI5S�8Qjk���RgM
g*9Q+�����20
W�0Z�:�1A3'����.��{9��0��X��
y��E�Zpv8��_d�e����j�
J8��z���>�Rk�Y���$uOT��*����z�z�j��m��uoS�+)��}��2�uK�Dz�4��~L\�7�z���
�s�������z.�@�����eX����B��p�x�>��������������n�9�9x����pT[8�5��wV�`����i���po������l�E�Q�����1bb��������gk�JTe�QUeg�������r��}���;f�9�s�k���?���bZRaS�Ka��aD4\G�j#[>bS�
���P��;�������i��B��k���@�)i
$�����)i
�4M���[=�g�(I�����rZ�l�l1�m���E������n(a���4�pb3k�'A��S����hUq|�j������hf�i{�w�������V�����z�>��}'	���[	+�����l&�Z�6�O:��X(���D��������1��i>k�+�?��	���n=]!#"K�~����Cp�h �&��u�bh)�\
��S����FA���[����x8�I��Ym��!�s!�\H�V�>���8�B�B��G$���p�N�_����DA�5����EtLK/�����J�v1a.����Z�H�'�U�>�uC���s�-nW�����Sm��1Jre�%)���~���b7,�0Zb��J1���Z��X-��F���CS|bLjS��;���B|�f�"������
o\l&�nc��1�5&L��`#��}p���> U�c�b�3@�����h�>��zPY1h5���F2XL���a�,�CD���P��������J�������UX���#w�7%�j��U�#��,k.�����Y�����Y����Pk}](	��
`UCU�����YyB2���:
e��(#�#�*�������"V�h`8���
�Kt�.�����qP��(/:���c�*,����������������::�iBN1���@�'*>9-�OE�m�����D^����-g�:��#7����/
b|111�W� P]���Up�6����m��nV�\����?��C�*���C�����<�������sP9��X�����!$��c����W��[X&�/����N��>�-6�����������u�Aw~D���O�C'o��N��y�6�f���Kn�%j���b	�������1]i��76<
����������!��"LL�WO@X�,������rJ��	W�Z�"=Q���srd2���r}^K�G2J���~�����^������#��>�6%7i��E��l�z�B�f+���cG��:������4����>��D	y���q��=tw�����%���RXe5�*�����t�SAf"NJ�<���%��A��X���D������:L���l�_�h�bSjcQEn��oK;��E�Y"���YM�
�x�+�l��-���C��lO(����a���L�i�Ge�M'��)F�#i��Hn�������C�
l�Y��YoL ��~)�:�����R����j��R�KC��
Dk �PF�u����j��l���k�mE���\	�M��x��k�dKM���D1j:�#�9'L�B�Dy�/r��������O'���T�
�
T4�5p5(i6P�l��l��l��l��lJ/�Q	�jx*,`K�ai}���j�n��cD�@3om�2b����t����Xr������^f�������
�	�?(9?���57�m��Cc�u�q��P)��	6=����Ez+��n�;���E�G�������3��Y�������������:�,p�-��#M����{+�}aU�����S���:o���g��;�'���@*^�Hy
�T��,�8w�J�3������Xo�(\�\�ZeG/7�dg���E%Jn�,F�w���j�2�6�5R������&UV�U�g�l���� �g�{��Q�;���g"�����{U<�����S����7����AB���
,[�gv�-0�1Fx<�]o������S<|���49����T	���	f��c%
1����"Z>Ts���������:�*��Z�IU��E*�$u^��X�����41�l���<U�N+�'�x�\j��t�|a������
����|�s�t�a�x
������f[��*�]/�W�����ft
;/X/��"Mx���&��:@A�4[I����xL��O��$��
��x-��H�mS���O�0�d��O�&|�b0��+�
����@�HeZ-�y���@U)i��"��+_�M��uYcQ�Y7��f�1����8�^�#%k�3�z�>@[�q�T��xU<�,@ ���X�D��~��x@�xl]��>��I��u���H��q2Zh�.�����t��*����0��go��\eNW��fz���6�+�
V�@`�
l��,�L���7�l�F[��k��E���Z<^�x�)�S�����8�(m�'vq �^��{�$t���B����PrQ(x+��5��;C���sr��l���x��m$�H�p��#�EK���P�dC�F�� �8�N
O)�.�8�3N�/���S���wl��GH&�7�~tu��p�/��a��IYBxY���&��E4�q��g���I�^\�gF�$��u�x}T0(���y����6�&2�(D��{�oa�^�nc�]��
5NUU��A�}8;t��E�.0����a�te;���bH_���� �+�>
�c�V+�� S��!�34f�R����i���5sSk�>�m7���G�ihl�T��Bk��|(=�*k��i��i�l���yp��5����
<D�q�]���S-�j���@m��'z����0`@0�0 
��t���z�������N�tl�C��������A��3���0g��8�J���UB��N��mk������w���[���y0�Z'&��}����}��9�v�q���{���yL�/F#s��>��;�O�5�q2nv[���5v~[n���p�V�W����E"��66��c�:8UL_y~��gD���M/_}�����I2��bf���W����$r�cD;0	&�c�����l��9��o��=���F����	Z�b�}��1
���8�-��#�^�����e��g4y�U��X0�F�Gq�o�4�Y,|�:1Q��a=��}�����i��zU�l���zYT�I��78�����S�0���P���Xn�8�V���1�91��[^o��?�bG�}N��9���~�0wUI y�{W��4���
]8�i�.G`82����B������B[����UV��y���$K�5������� %]�K�������q��n�V�cW�cV�cW�cX�cV�cTdY&d\<dVRdQhdT~S�|��v�&�)����������p>T��I�
>�U�G4�*��T2��T2��T��J��h��|L�
?&�6eo������U��t^��J
~L
�T�x@���}���<Y���,tR]���m������]N��rk��=��4_c����������AB�����A"�G+�m��q?���_�t����������������aWE~�[5�r�b[��7m�t�ue��v}yP�����������2=�����V��M�h��-���(��_z+�}�]�v�NZ�n��z{g������X(����UmI�Um������\��:�V��y�o�>Cv��z�o�i_s
7�0b��bM�h��GS��|pno�L��r� ��9$� �ab1tA1m�
�3�F�@�m��C�Fl�����'�_l�|#�����+$���Rf\����+E#f�F��X)�Q4b�h�V����E#6�F"��{�N{��(��8��y���`R������F�(��FA�T�B�i��	�Zd���!-p�h�8��,;"������N<](A�x�KTb��e�M
1���^�leK�BPAj�6��O��i�{_��hCY:8���{��#2Ah�.��Lz�����\pB+��2����+�q� EM�r��(�����K,i��y}7.5i��%$�K7�d�������;�/�D��*�&
��)B7��1�C��ZH8���C�
��
�B�h=C
�34��}�V(C��+��>���&Xv$C��G:6���3�$FzG�'���;cr_� ��A��F��G�E�(�(�Q
�,<�,,�,*DY������P�������-;��f�������w��n�������C �����@>jw6fKcVs��I��x�Ka\3�*�*g��a���8n��\Uj���qM�!&���W?��n�C���1�n`�5(�:���*�P�����Q�0
��C\h���VT_�q����3,�����SZ���K2���������������h$,��������#������V��	�
C�R��/�oR��'9�JR��!48���������

��Qi
"�rj!�q��������C��|�8�-��e�&����qE�����`Z��	�*�8��g��s�goW��H
��E�y��Shr9v�qB���������0�|>u�QB��k���$49i�$��o��	�l�1*����A�N�
�\@���t��<�������sp��t,i��Y���p�b*}����.��o��u�C����;�����vkS�e�t`���n�i����1���:�iz�����`�6��kz�)��i���:�)z����t}�����\�0�/L�k~�q��,��������h[���X�X��(cQxcQXcQT��h�E��E��Ea�E	��J�S�f��������j#T����a-[61��_o�7���k��v^��W�Z.�!Lj�)\>&���B5.�f1��)���E����T�^-����+����*�V��w���?[m�����������i��P\?��C��Z���/�����3D����=(�?4���^�cKh���������@���.F�`���`�{Pg{k��5��$f���0���\�=,�=�e���[?xvU���T�����Rm9t�}���w����({�e�w�����S���]���^D�+�����`�~@���$L��%~�UP�n
�VjNU0-��4+�\����ty��>�w_�4�Iv�S���f��� �o��I0�����;�~����Ea�y����z��Sr60��:��7qe������]5��~S�����T�o*�7��
�M%�M�����BVt�����oW���^����)�������#\zE��i-�������c4@�hU�.�wF~�}7�|��jU?�!��w�g��,�/h�������\��p+$�m������&�/�0 ~�>�+�\�Y9���1�)�{�T�����{1q��9�pbdB��	��Z�. b!T0�AePTP�AdP�Tp�AdP	3�@*�qPx
�:�}��z�w���������"�t�������_]�_�\QE%��5o�<���Jwf��/�I�j��Fu�x��yQOC	d+j�F���
�2	|�?H���e�������E��G��:i�$�xh2l�7�6�r��vx���*'��|"Y��33���,J/��V4���f?�V4���)�h�J4�M���D���E�h�7�f�T�/��;`%-�]���	���)~��*�C�0x��6��M�D0Q�N�0�.���(�,\�������x�SW�4�jb�����N#JJ?��?I�����
��(*�ra��$�I��T_]�%(�s�Z0��1�3�TMV�*�F�kC���2p��8��.E���	%��7�[�\`��
�&]���6��A{Ce�cWx��4 ����z�`6����3@I�YG0�����������u|�����Y��L�����C5�<:�
]
q���7�]9���'�?D^~y�j�ux���3:�5����������n��5��!��O������|�&kaz�y����������
��+��'Ec���g��QH8����4�)��$��QT0(%)O3	<;!	����R�^����U�gp����Y?)���������I��M�gBa�PTL(Z&�	�bBa�PX&�e�7��JY:�zBC��E����R4$(P������ |��4l���/��x���/���J�E+���P/�����G@�!�3e� 
�@tk�
�E6�����!0
zg��1m��en�w�%.e��)�	"Lr��`�Q]��G�v��>{�?�����9�3���Q����|4
|��i��Q������n�_���~i��;jmV��m���v�������"�wkh�Zm3�,�i�i���R0��otK�{y�;���X����&H'��@�V������I��	��cXE�X!�d��D���q�w:Gka��bp�\�&�>d��\U�@.
�\�FT�����#�8��8�f�a$���J��-�y��"
��D>e�PS�	I<l�c��E
�'1����5?	���/>�J9��u�c|e������z|{�R,6�M�VM�:�?,�;{��l@=���]�QFQ�����������������+������d�
x+�Y}���?C\K�$9
���W���b���m�����o�����*�3zP��kAT���Wr�f��b�0��B��J��Gb_�����D"�\���N��$��������"����}�hO�q� ]���*u�?�n�E�����/�u[�'����=�.��O�h��}X�<�q�$��/��P=c������RgX����l�3��N�X�S��.�`��l�i������������������a��k��@t�pD��P�Ko9�n�ILY���������r�_��H�
�����<���>y�|�!�s�'�h�F7��QO�'qS�=�|�bYx�f�Pk��2j��8���sN5*$��4�5�h�"\���(�u��U��6�/�dUN])a�^m���2g��<��� I�f� �1B����cgx����)����C/��A��.��g=��ZZOm�Z�� ���p��"T�H�Q �"UF�3���C�,�b�h���@����)��&�	,�S��Eh������u�+q`�3.������!+cnz�z�fE��uQV�O�*�r
gD��V��
""#�V�x����l(` $���L�0�P�xd�S�6�<2���i5��A*���+�t9��� �Z>M�2�tp��*G�AS�O��������bSi�5�/a��ut@�ju��!$�pZ-D�w�|�8��M��������'||���9_�� �?�J�O�Y��;�������p�V�t6� ���#'�������+O�r,,3��?=y�?=y�#ou�����)��)Z�C���l�G��%gz���Qr%h���!����
r%hW�	�X��A���D�����q]�q�8����$g�'9w<����^���o[g���zr�9s���;E��)r�N�+{�������$���]W��l��Z0l5�H��8��]�A�K.���F�N�]��tYG���l�D���eN�,s"W��\�,s��X���2���e.�K�\.7�99����e.������`z�uL�:sv+���G�_��ib 7L$��,0�z6�q�1�p'�������Y�=u�:����#����|LG"sV���a�%��K[��6�Cr�p�P�a�Ta���Q�E���q���84��h*��{���G���aidyx�����Z��i/���@�
��`7
��`�	��Pb�(�A�<ZF	�Z�Ty	�@�w�L����{��L�},]:_���98#Z�~�o�\�����v������U������}���U�I�� sg�iZ��~����0��<w����vSg�h��B��w�����wO[�r���U��@O��9�0����I	r�����%�K�����,��N����gpF���n��'�6�17��o+�P�1:�P��7V��X*��������T]1��m����n������X,�����n4�������h�I�/q%p�	�}�w���'h�	�}b�>��O��'h��.���o���[5_g�jU~����)����t��[F��8g��b]GK�0If��>b�q?�S*�
t&pF|�e�#"����a���[KA����*Q��<��{K����-d�����)���������_������<�n�C�E�5���.2mwBha���lk���5��
�a��dKN���#��z��'����7x�w���3|�O��B�O��c��8�+������a	��S��)i���c�=<{��G�0�4�<	��$�u�����}������x`�����[DQ������������?��r��
u��H�i��9�����dp�wZ&"0��D�vc+y��n�S���I*����f�����	=/����H�r��E�F����=�������w�r�M�tOx��<�h��D������vC<nX�4 ��j�Af$�>���@�B�����z��jE�>��;feB��0�re�����/;�RV���i�\����Z���f�x�z�*���������-pKKN�gCGE�
�)N\�h��P���7ew�N?�a��f�X�^�����ZE_�X���	�0�#��%<	�P�#��$<��U!���!uuh��T���!��s8��I�j���l
���B���58	�r��2Q��8�����>0P
&Z�Ha��1�(���@��T[���Q"��2����_M�R,}�N��70����-����)�d�'�^]�o�����0��+�I�Cg���`����E*���:"7���HE���6�4�#���I��(fe�bG���Z��-�P/����ca��y�,�W�P�������G�b=0����,����E_��1Y^x���mCv�7u�������������^T��|(5����m���v���R�,K�fu�s8����3{�����v��,,�����#�������z��U+�d*���IyL�� BEK������]�j���3"!�HHO$	����5��l��@���=����s��P�[N�A&r�u�
�
)���A���d!�q
gD'��j��PFQ�J#���3W������+��	���@j]���E�D(���f���W��Gw�M]�@F�FV����L���c����p��#��=���/�H��>����u���m��y>��zW��[���	�3����}�<k�������f8���RW�����s�����g���J��a�\�;%�O�������:��|���4�����7tb��x��5dOhD������=������P��z�"f^m��jcX��6T�Bd������v�'������[���D�%���Mw��/<J�]����^�9^,�����S"/�ls�o�uo<�� ��y9������~�$�<�)��hy��&e6Uc��a`���a��!�����)�]<d�	�I�9�pN ���	���q�b�X�L ��l	�U�%P����	XY�aT��X�)oyX��B��zg]|?F����Z"��D���<�n#�i�6��8i6�;\T�N����]���P�Ehn�M+��M���[c�����h(�����9��a�b�PD4�_C��H�#�4���_�]�}���_�IeR��U��pE�b���xW��/���\�������
��(q%
�F;��/	�y��}��l�Ut7f�X�����������3n*s�(��a�����=��OE"�Y8'xq������n��e�P�b��4��8x]���tEK��{��v)��9;��������Gw0Z�W)/�D��a���M��D��o�]�n�6��y��`;vl��(Z�@��k!(2}il��Kv��KJT$��x�R�\��vwM�e����p8|���s�l��8�y��1�f3w[�
4���Ek��h����I�d�P�������*��,���Va�#p����#>��8�G,�82� ���o�3W?�C�<~�����!R�QK3��.��/�>$Y�p�nV/U��b��b�?+��f��J�	��R4��.��(�yFT���n��7�@�o��@�
<�G(���#x��X�#xdA�SmP�6F*H6���r�Q��A��X�[����r�.g���78�F���r�������s�����
n�q����?����Q��)�[S��8��G27��r.^7�����C?���}/��.Cj���j�`F��� 4 ������g���u����#(l�(�/JI7��^���3L�B�B#���� �Q�P��/4&H��"�e^�_�,�\�A�\�J��[o�t)�HkT���[�\��-8�h��vx�b���N��hmH1mt�R]��k��6�l��6Zn�����;F�&��Q�=�D+|Q7��{�����z8�x-���t��\2�A��#X$��#<�x��m<��I���Fol��|��VW�x�����Ju0�������,�(4<b �C��2����Q�G*/�\=C���U��1�;���������r���.��f��i9����3o4���/_Z$2&���R�:d5�T_[=��H��V��C��g1WF���S7��
`�ll������VF��>�M����6���gV��mla��Q�c==}�����x7]�NF~���Ec?��~��,/�I�O|�'�'�'��W��J��V�����Rx�*<q�m�Q�q��J���G�Z�B�;
r�8�Qc�h�A�%F�v�SE���OW�Gl�B�5
Y�Q���nl�
�U�K�cM�*~�H�hC��;�n��2�]����TFv��A�����Tw���SN%�|�,B%�g��s&���:�n'����x��,zV����B�r_�JC�D�aFgg3o�*1���N���W�\��l.������l����R��G^��S����1���1��nV/R�a�KmD�x��+���������\��y�b���������6U#o1E!D�{�6��6v����_|:�gr#�l���|�����a�Js����A`"L��`"3�
��"S ��T�UK9��{7�:�1[��m����9B���9%pN	�S���XsJ���yN	�S�%�����Gm���7ZA����qLG�rl6�Af���'x�	�}�f���'��8�d�}�f_6�����_�,������@m
��s9=X�\����Y�RSXb�}�08����y�JW������*/�)q(��.Mwq��m\e�"���u:��
=b���G��#h�dc�`���T�}�^�_���A��rzh�z'G�Ae��V�T�a^S5�xK�{nO��HED_�������r8��1(�����F�9��
�����s����!T�I:	�T��w��0���8��*�=ce:�,��M���O�Y>������I�?H�}9�Dd��r����
}�~S���� ��J[������`�{���R����
��L��E��L��tg�J����g�b��]�I���\*��%9���;��
��^��LHG+�g���CJ�������n�����A�r�u|c7���0v#����n�g7��[��J��������Y��	@�����!�,��P�����1���`}��1~����N9�?x��A�;���=��Or�B��1�D�������3W��A���$�������?<�\�P��:)�g\���WY1�V�^Q����z9�p���^�V��?Fi���I�K�b����u"�����������$��{j�����o#1�NS}r�iQ[���"�B|��5�}0�Ne�kB�i���w�
�qx+V7�&	2A28��+
�u����  r�V2�B�1���5ZS�����
��}6
��"C�����.�e�;��I|���0�t�<�����k�H�������>?r�����,��YT���� �~��VZC}�
�d���K`;l��=�T�@&�l���4���K�zw��|�'��B�
X�n� �� �������=�$8	M������(P7��vF���dwyw$���r�!x�.2�,�{����c��K%�oL�
9SjW��Hvaa���H�!���`��'r8?Om0���Q���8��j�wr��{j�,n��;z�����5X�������'?��m��eU�����S��+��u�uG4���#�l���������\���)���y���xl�&>6�;����H�"O)�fm<���S*�ID�����j�nVzm��cOB6|dOMv,eAX�����a4GUR�Zi����J�4�v����g��|^d��[�����[��l���~"�<���d�(q��Im���GU�ID)�o��U�8�habvoK��3����'�t:_�M�fr�x�M��29��;�d���j����S=��NR�8����{���F|Ve���u>��j�o�3�l��~���3 n�L��XDh�,dx��@J�Y�-[���e���+���E�;�7Ip0^��������I�����������G��E��"�h������u��
O���6\��bv����<���"�{`so���k�+h�������adh��\�P���4l`�[:a+����E����c�)�OZt�%J:���B-r����	�N��m��NZ���]���,>�3�:7lV��)����~.pe|��Ip������	V���LRIV�t����&�J���l��(�R SM�>��,G3��O&,�
���%���z���aY�^;�j���f��'T��;^l�2��U�}-������1�_�
lMse��7�
�����|b����@)M�z�����u���>��W.q�~j����f^?A�'������HZ��4�.����u����g0:ee��hqS>?��B�}��`���)��p7V)������������K$��qr�O��Rq�>u]V����w��X��#GUc��J��b��"�������!��D��F{Q�
���� �M���|��B�C���n���,*1����B<e�[mG{�fr��$H;B�mj9LS��`S /�{a�sb+������fU.bq&��HR"�D�6Z��F�g��E�D�c-@��n"�w�N?�}"��bS�JW�>0OL��s�:#@f�V;�_�W�o���N�N���'U�
�.�l*�tw�����:�qw��f��r���d��c~�a�RL�Y�ZF�D��m�����~m��3%<�TD|��n3���~%�R��O���}I�7Y�A�����.BW���������-�GQ;�5��x������a�@��
rC��!9"� ��b!�@��9!G�a���#��s�m���T����\����-������N_Gx�&	O��F���������_9.;��b��3.�\MC'������w�,UG:�)������OAr^����N�G�[��+l�� n��k�Ka�\��i��K�2����|�M;���FM��(K��9.,�!�0~��u�K���-�1��#����E��[G�{g��H�so�+ �R�3��4���4�K��U����H������A�������^+�-r.��h��������eCH&�������,�/Y`UE�L�1jw�1F�bZ3$�w��\
]`�dok"��t����M���VI*C���;��R�{�����M���jo[6��o3�r�S%:_��x�NE�r���������X1(��K(���5�h���T���2`}� �b��c,�|engh��GJ����*�0|�B��������]�
|�	���������w%;O�@��SX� ���H����2���$8iY�O�����3q���@��������+�Q�9fE*f8U���������+_1�P�^_�����Q�F��^�.���h���*��8�����3�k�y��mt[��=�
`1�/z�1�"��1��cn��Pv�Y�����f�l<�F�y#���V���/}M�-M	�	����S�����(�k�(F�o4'�:�6F����`/C;P���2#7V+���
Cc�!��PX	�q�0r��'�#~�uZ�6�Y���8f�;�-{S�>�RO<'���u��:/$'c�I�B�:iLb�`�c�s�3���#��3qo��'H��F�F���/|-�h�E������R���S�n�1g�5�4A�R�1"�2)/�F���
�V��X�@[(����XZ���D���:�C
W��7�s����	t�����O�#�bf����Mx�E�L�d^�S9'�,���{���[U����5���.l4H7|F_C�(��}��O��4||�9GWW�Z����S�+Tz��Co����AR�� lF6�	�8�=���=����_���6��{15h~X������z�Hz��z��z�Pzf���$^�i��=j��:=>���wG�H�u��h�_�o������%8��h�����xk�}�~'����i��b����:x;���Y�[c�w��;J|�_:3�]=���[��C�)x��MN"��|���c\�WX�@���!��p�����Q�$�<���]��U�uS�]����n0q&�#�N�6��q<�����d\j���fH����e���mE
�2!
9��?���'h{���_+63�&����?��	^��n��_=�{��K�b:������g+���� K�I�E�y_�<������*z>��xi,�	Y8ln�����$����i�/�A��f�&|�+�
GB�A�w�����5�D�����I�r}KX�>+�7�8��nsU�Y�����ds��{���e��&I*���/}��h=�����,vMWJr�6C+N{���e-���(����;��8�C>�`���hgXIdC�)�?i�_�k�
�C1,xZ�tr	q�Dg��I�lT��Y!b�k$��	
k����'����(��
��u��%��,��@�������C���/&&Jj}������>[���o��GEsxe�1�AFP^���IQ���r���w
�7QjE�%��|���R�E����RG��I}9A��IQ:��N��A��4,��*�����,��Z��v�r�'�!&�
�	E��}�r�P��@�fA�j�N�]�>���S�9>+����ZJ,hf����IC
m���L	@���!FF#�����xm�D�?<��L���)�L���8���Xiboq�]ZsX
�H�����#��h��|�0���������u'Ba��^x�T�O1���9�b�xAW�]q�]>�[�%=>�A��\��YM1� �����gn�gN�fV~_�{;�K-a�E��������J�{�+���PuS�d0V�������9�����<4�p��x_���M�����hr���{vfi����<�A+40��f���|jO�i��;*4c�.�����:B8	%�}�~���*�W^��Sl���6��O����9�/����#���O"����'����?���T\�@�J�UZ�E��}�_<���S���X�wGghy]���7��|�I�rkq��R�$>�KBz�e:{�2�@F���ms�����9/���}�W�El�0M�6>�O����9�����d�p6�3�O�[r�Et��X�*�9s�e0>3�_���W��x��?�����i��D���[nTI&�V�����p�(#����D�B��imj��B�j%�(I�"��a�O��������`� A��"x�S
�C?�����s�o�����#���n�}wn�G��_t�{b��|k>��F�2,��
|�	V���w���	i���%b������u�/�
�,�||{�{����gF������y	}^+��ORT�R?�M\1I����?B���9-�)�C�'C���h[Z>Y������`}.�4�JF����|��t�0�Y1]A����E�N�����T�?�H��p�E-�{��`Pa}�3-VE
��6Z���V�FXi4��(g�Z��9X%�����I����I7����$0��h%�m�����5�[G@Q;��Zx��;3�:�&����cC���o������� �gM����H`��M��^�v~P��	0����R#?�< �X1����I�_�y�������]�������������L�Z�R����g
?�xcHh���=��9��j�g�nL�&)�Km+����%��P>���Ga*YC�@��bub8�sr�^��gr��Z�[d��^���9�'L��T�h�|�dF��t���TD��p������Z=43��R��p��\P*��A�&;fS/)s�j�L��{�)��Y���A��k|g���>c������<�1B{L������2}�|,�;���_�x���q��	Q���i���T��;)�W�{�(?���W�\e�^�	�M7[�X�
K��oP�T~��~O��YL��P�fZ���n�Uq]��PK��6Gperf_reports/20GB_preload/PK��6G�+����D_perf_reports/20GB_preload/ps_0_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]{��4��Oa	��-I��)���u�/!���
�&!I�[��q�W���x�p�F���N~�����c�u�^u��:�xZl3��$~���
v-RfY���;5�5Mf���I�"���e<�_,�1�Y&"�s��5�O���t�L�����h>y����
g�"����0���;��j��+x����
B�y�6�Q�����G~����
~�E���������b��$��_>��=�]�����>�X�|������K>���+���m��^@i�<b�I�{�~n��m;6��P^��+N�l���U�S>n&�$����]�l9�������8�����w�E"����j���$������f�������pS�^�Y��*���u��w�m��0n���C���YS�'����=}�B��/�z���_}�� ��5_E�m��v����a~���o����/>�]��$|�����]6�}�����7v�������K���M��Y�r����6P��tV�|"w#���L6������K��j���
6|\�oVI���`����bq{wy's9��OK�
���;��?N~b�
���B�]�"��P�_|{y�w�k�/�}���8�����9�;XJ+�E�;��`(5c�4�,�x���m�n	�E����l>�/�4�()\_I�qY�e	�F�a3[`�������4���aq��?��h+��A��m/n?��..��a?@���\x���������d9�����S�������uw-�X��S}���"4DLt;<})�/���x{���E���/�kQ�jD�!T���Zw�����E�u�5�Eh�X6��*��P��H��#!;4	Y��?0d5BCbfaI�"���p�l7I��bHA��������W�uJ�|*���BK����h���FhI�-���}�x�x�l����=����UO���E<��gsp�0L�"4��h����L�F%B��J�I&^t{`,���n	�R����y���/���,�P�!4l�w���H��]������lN 4����|y�Fp�ZS,�t
�[�anP��0�����4���0�p��kFTM���@w�R?N�k��<
��<�� 
���m��b��6G���>/�{\!�LfX&���U,GR�1O��d'�&=�����_����iTfs
�����HW���,��`����a��o�,���a8����^.�gj�����
b��[-�(�rP1L{>5��c��!?�97�IF���C�0��X�Q|g��� fx���s��
2��t<������rM�w�e�{�,��,�,,K�4�L��7M*�i��5%�,��`�|AE��"0� ���������\���>�34=`��(`��ts9�=:
L������M����jI�p����Z���8s�&�83 W�X-�
bM}�nXV��"w�s�=��7���-�e
kA���eZd_����C�[��O���p�������YnM�^����M�l���t/���'?��4z�_Jo"�!�����d����0|2�'���+����5�E@�s���jY���'����d*�2V#�x+���-�����s�>����&;P0�Z	���?%Kga��v�3��M{�\����d�������e��E~�YWS���&�m����,��L=����"��_:�p[r7�Y��������3�2d�;����4G��|z�o����,H�:Lx2&6z~�Q<��;^��"����O��Z����	���F�B��^E������EN-��j��������%��O�&N�Y��%��P�!q��3E�e�3��0�ye������#�t4�7^d<�y9�x�?�,f���:GF���C����A7�$.K0�z5�`��d�y��_c���:dV��lN �����>���LWI%ZuT#�t�:�K���`��
	]'2�����!�"�4�uQ�60����hE�s7i�n�x\��,�0�#�t���D	���M	��B��G�t�gYr�D���p�G��kbZ�<�/=����
��<�k��X�:��ROcHQ\W�+r9�@��'n��@&��r��n�"D"�<�7�0z��H�"�h���vO�H���L������Z�yZ?��Q"���(��7�O���>�����:�������qhp���3Y���L��1�� {'�����	�E�I��{-x��9�$����SwSn�.�,k��+
���-T��E@<����jp�Cv����/+�� ����X��	�����gN�hn	^���m<H�wo����3���,����ss0�<��~	��s��_���t��`���&,�Z<9�`jT�fm����uE�Zf;���1�k5v�OE��v8�R���@.+�u��
2,�N��f�-�.����?G���(d��.�Z*�7\��<���_^%9��
���
u���Wv���/[�PV��R9�y3�m�Z�|+X�r�����J�'Af��9MB)�����x����^��N�����y�?]�����z������=��za�V>��\��� ������'��}rW8�q��X�S��d�>��q�	=�;s��pwopg�{�����ka�=�}&�?�=z�f��������W�	���"L3xU������rLyg�=�=y��g)��B	�/p����%�5���Es�r��Gh��'|��Z�Wk��>��U�U��j=��G ��r���Ir�M��}�����~_��2�ys"��>B��~�G?�wK���q!�����J�[��0.\7�Q�A�=�����c������ ���F�=�S�X7�?9�	���`*"J{D�����K��b�}|~��1����<���V�q�j����J�v�A��r8SJ���1��15�cj��Td�
���A9���[9��F�C��3w�k��
d	���x��FK��9�@��e'rA���UT�@����u�(�f��9
���5�����l7�#���b+�)*5�AB����g�~�������)�S����W��2���5m�5��tA�s(����}�6<��j�����U�Lq/;�fI����z����:l�}C�<��#����m�h�,A��C��O���}������"�|�^�`��T������5������~/�D�@�������S�K���	B��r�C����t'�_���M��9pf����c�e��f���d�����WJ�g��k��
����y�����*��������������osu��K�q�dN���dUG+��-�G�Zm1BD�"�Jg�U��m��t�������J�k���i�����4q�T����[#�S���d�b��Y��Pp�N�h,S�X��y1��1UMc8�b�:��!���U����=N����v[>~����_s���<��x�Gl���xW1�`X��hHo���8�*���h��`��)(u�0$$e���dn��������f�^Fq33%#��xea�vWe�J�Q#���32i]�m��"�U��:a�`w���ZY��W�K��R�9���w�o
�����v>�Wi�_.��v&V�3��#{z4������
���������
~D_[����6�;�Y�!=����"��3<?x�bNH�Q/S����?�oP��Z��
Q�����YF�#F�#�/GLW��Xr����� D�u�������5���'8�t�1�|*��a��P������������sh��/���6a&P���3����=��3St���3�b����+FF5)�����F�����{(
~��X[��U����8��pR��p���L\g�j�����t�<���5QKX�y�����c�O����i:x�-)x�	�mc���d�2��f?O! ��}�"N~��a,��VJ�=Zdo�A?~���a����j��/������
����v��Q���q8|e��B�v���1����UG1��r�FiN�������H{�����=U'j������Do|�1T���0�c��W����rVj�N��>�������Hp}����2m�*�]B2/w�>}z�;������e,��bE
g����I���9D������������M�%0<�J�Q"s�P�T��-_W^8�v�q��b%�\��}����4����
�%t����mm��^Q5u��n����A��rX�U��7H�L6r�O]���=���0���*�q���Wl��F��vN04�w��
��r!�.�Pm��V�����X�9�{x����}�����+U���O�q�������y���e�aMI[�BK�����3z����K�����"�{�GgX��:b��-aS�`�5�%���S���
Q����wE8yn�^�.�Y#��_����S�u-J��3$�N����"8�������(6�{��d�:��eO8�4���H����]{�YF���4w�I(�t5!E��4��/��_y����}1���*�
i�@T�����%�V�����P��
��D�u2���.A��[����� H�5�)�����'nh��3?���E����f��V|��_��l���J��-�!�EU30:�!�H�n�^��2����Lat����,*ER�S����O���~���mU!5Ku�B�^�&�v�C~5�*�h��=�i���E!�F�0Mf��aVlI{�u����%��u�DQY�a�C �)��S�	��"�>��>B0��t���\-�c[>,�s�[>��7Sw_R��|�=J@����^C�A ����i�*;���������Q�������Y�H�z��o��-9z����
�����W���g�e���w�K �j���"Go�i��W�_��"�[��&��d���L�����^S3RS3��f�M��} �(���8��*w(1��������z9�(G��E��k_T4a�T���i0��Bw��2^B>�F�����0��*��5c����^��opu+�jZ��fD���R��g��h7�h��(Z[�f�����
��-�J_�7/��\��2�	_`�K��Wi���s��w�(������3PR["Pb<`����e�k,�8@K��A�QR�����'��%������=�!��`���K>S����X��j����/�Q������Ha��
��Ue�
�p��K���ZZ�r��C��m���\)����E�P�=��s�����x�POsy��!��:����������v
���4�� l�����S�T%|�1����>��XH� W�m���.R�.LQB^N��M�S�KW�%/x������o�l5�p�@�Q���_��`�t�2#h������|���3_����t�&-Q5��������|���9�*��G�����1L�������Gv�l�����#7P��5F+��Qy����J�9�������'���T��iGX����ov�e@A�OlR�oR�kRy���+h^o�f[��y���4�q.��)��*W���N����'g^F�������LM�j��`��Ai�������OT�I��p#��.*'
z��E
W#��hY��f:���(��
_$�e�B� X��vJ������9z"`N�����wAU�as�@P�=}aL���H�f%_��d��^S�V�]�����D ._�8����a�5+��Yw��ox��B�5eP�wWp��j5A/�Y��U}�P���@���x���)R{���:	dr/����{&2op���F{40L
��B�
-���M�ZA�AIC?��a�=���T# ��#����s��M�JT>���?:T#�����	n������� _
a0����?lX�W{W��4����H\@B���;���f�5nkH�`'�>~=c�i��if��6=�T;����e{I���ov:��K��E"�&tVM&�j���$VM�:W����������&H_4!x�t��np�� ��t)d�J~ex�N��s��f%����Dr\��4H����5� 4��vM�e�\�8�}��O4�:I�E�$�FTd�����^j�~����*kEd�3�]���w��r^��,�q�m<Ov����f�������j���K�+�Q���j�]V�ycX�-�|��8Pg)�5\�9�G����.�t����:�1�%U�����fn��C��\�QVk�����n�����Z���q�F[`��L�y*�����f`��u2����(C4�����D�� ,7n�&N����� >=GhHj4����Z��M��Q<U����!�h	�9$������'B<��!pi����#�I
&�0!�@� �����o�>$�o�A��K��T�Iu�%|�cW� �iv6�S�=��1����E��*��<�t%��z���'��7O��Z6ci���t+4����Fc�\X�}>���4��Q���1V&��g���8����CM�����62�6�^u��J�5Q��g]"jz���Y(������i���i�#���7vk#Y�����W�"��)�\����!��TsvwW�JO<����I�)����'�2������}�����A�X&�L�^i��!�������U����=�jz	"(��;�m���9_CETw��F� k](��9?�
���AFi/iZ�F(d8�
Tk��~k�V�)nx�h#P�
����W�$��6eS(&�-�Z7��JY�|6OP��K�(������/�6��}����� ��������v����|~�$3�����w�b^���R�=0��Ylfu����kHG������i�o7�*=�"���""N�Iw�I���:�G�*:i@OL:�*x����o���"xh�fcyo���g���D��j�_&D?Db)4
L�T����1���$���-��u<��GL�n�*��U���>l�L��q����`xxu��O�z�>yv��T���0E]�uh������K�Z��R��0h�'8A���h����0���uu�����V�f������n�	",���M��U��V��t�!��cJ�tw
QL�-�����"KzH��c�����`�+\��FM�*x�{�t������b���J��G��.��l%�w�������P��>�x(�7u������,m�`�����j~d�-�E�W�sDvz��}m�iHor�`.m�C����s�q�v8>���jc�F��?�!&���P����lj���c�/�~$���������@��.���[�%�Y����>�E��e!�����Gj��3e{N���nl��f
)�_��a�C�����-��;���Ws�����=Q��;�z���f�BO!�u���I��������P�rFkF�a��M
\�����G�U��Wo��$�B�$�����2��=0<��;�I�����;���xU+9�E�>u\)�>W�B��l7�8SF�
`J�����*>���O�)T�&;��q���nyR|Q��Q/S���'����`����t�=�t�0~PK��6GKL��8�_perf_reports/20GB_preload/ps_0_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt�]{��4��Oa	!@�B�w����x�7�SEN���k����.���8I�����eh���6����3��xb�L��>/�L��7%OI��E��pr���������l�X&u��MQ�9[q�ak������|�Y%~����'��mX_z�]�E�%����-/�E�,k�<�\��Ln�d����zS��"�����#�e���7�%�U_?�k�|��������O���<s�8�k��4|���~ooZ�p��<�~s�����W��fK������=���8��{��U�qzsS�o�����qK�eJ�\�'k�������5�l��m��'K^�A����j�A�"��������jm���dA��}��9/��W�Q����|�o�5�[T<Z�������X>�h~�����}A{E�|���~���d�gE��,^�7�z����6$]T�%���������8H$|�b�54�[�>!E�5��mc�
�5?$��}���eqg��qB�������|w�YJyZ�V,���?W7LH��/<�����~K@7���k���<�{Q�
��B�z��������O���#���l���X����>�����r/��������'@�������r��o�
�|��S���3����E�<+�UT��k/ahL ::��:X:����2JV��K:D�������|i<g�eu��~C��?�5��Q�6�F��x���������'���EfX_.��6Y�K��~(n���O�������E���/ nXuU\a�o��&j~�e�����L[��h��]?:������~D��X_x����4��gn?3��u�����>��;�|���6yR���5�@ ?/�x]I>�!���gc���D3�N�Jk��ca�����5��rt	�b�e�0�������m���gb)��d$�0�Bt,��{X�-�T<L$�(Qb1�@�*(�m���X��;R��/,jk��u�>Z���Eiq��
xpK���cMWt%Y���%[N���.������!R�
�a�F07�-8B�H_�q�"R"EdD����)"-RDR���H
��n;UP�T�.79����ZK�����`	���*a���-D�����Z_r]�u�Y�L��BLDy*�� B�Z:0_���,M[����/�������C�����Y�vE�~�a��r*��X�Y����WJ���vGu*&�G����c�G���}0���O�������0����hn���nZL��e�FN�)�b �b�K���]T�����xC��=���B`-�c�+��e�"ZL��3[L�0�����Fz��Mqc�_�b��@���)�'2�O���H����C=�#BAN����T�=*�*�b�yG.�AIzD���������~ 0Q^��;V~����%3t�Bt�L#tU)=-�1Jb��=��eY�/c�����������EY�- ����U"���c	2�[���;~�eV
��yd�`|��H��z��E��{-���Z@�Hm�����@/$4q��(�7��e���������%���� ��Y��N����
R�ab���%�-D���N�����B�����u����UB��*U���'�����;�]��4N�p
�?�GZ�B/���*�}��/��@K��qw*���@Tl�Q��2��~�I�|,1p���.��d�10lW^T��QV:�	
�ft�g9�k�,&�2����B�.��o������u}�SB�.M 4���IOJ`�t$�]���t��������
v���L����E��z �������~JW���\����?���>rn��75������%=G�E��oV�\$K��j�����C �ED3���'&�N*^�aV���\�jn�5!���Wk*��_�_?�M��o�b aa���3(��\�����[�+���7�Kw�3BK�(�T	cGC�'��������`4���2	#DZ���xI�"2�%���gpU(B��U4�e�t�`����M�wJ ���\~>&���H��a�?S/�>sTL���&�l���9��o_�$�3x=[]_����8���$x��2��yt��RA����2��%�L��m��s3�tB�%T����h���e�� ��3O$�]j�����P]J=�1uQX�C��qRN]]�,x������I��������(�a���*4tK��k�L[\z�.�e���W �g!�FI��uQb/H����^�����X�h�8a�o�(7O%���84�l]�v3�VC;H�6��\_���t]��q}��o��T(�c��%u-�k��N�hk3MO�8Q�63���=�j;	�3n��m����j9����Yf{���;>�v�y�9��������[n%���`3v_��L�$x�3�Va<8�����e���Rk�Y���B�M���N�=m���v��&��
fk�'�
��
�Ihjt��o��3m�h����l������Z�j���C��m�e�����}�IL�%���'~p�dZ�>��X�v���y�
�X?�C���b�k&��Fff���S;�\9�`�[+�:�������v���-J��Ty,��d�����\}���,?U���z(��=dQ���������+Kn�P��D[�2�I���c�������(���j.��@_=3��C|O%LbK9�6<����}���h���L���Q�������D=0�E��O��P?��(�{H��~��K�}�-�&�������zwhZ+8�:CL_V,;�X�!��*�l'����1}#�z��<�%��P���
k�'�� ��-����?$��A�#�V�9�Od>������X[�`a�\|����7��E�J�X{�,a������S8��9�����7��i]�"_�{i<�D��������[�'�(by��=�,�1���}���E���b���[}?�M���|���K>�#���)%3�M;�R2��ll��F�=�W��d����$���b�(tMS��+R*r�V��S�AS�K0�����9��{I�e������$Y[�)���$���=;u��i���do��=������Wl�1�8�v��M��`'	�O]��i�G�`"��?^u�i�=;#RDO[�5#?��H�`2y��I���#IO]�T���G��:����D`��s7O�������K���I[�X�D���g�#���E���uP��<mxb�����j�
�1��V������Etj;|Q������������=m���x����������NlY�����i!���0IE[�z�<�,6�8�����7��{3* aE��s�D)f��Z�V�
����7�h�t ���h���_����2/,�"�~��=#8H#^(�">����x�0�\f���5y����6g�&��N���T��6;n�'86�&�M�9��IT��8G������&PDcc�DO�Fo��4�olQ����]��X�kND�G����vgo+8�q4�e}��1�8����.�M�1}��6��]C�kV���9S
>?,��s�LW9[W7���0|&tv�S���Ey;����w/9�^�-I�����sb ��TW�t	��w��1pA������l�_�����w8V"'�����y0��Y��[��B��9�r�W`����Q��3���������TZ'_t��tB���������A�$,��U�E��!~Q"1@ F����o���'Ik/��b�Vt�C���_��_�n���&�^�������,2�,�s��
�RH��G������u% ���e�Y��j����*��Y\���-7��W�������.���f��yq����"y�PP
�Q�����Lj7T1�~R���k��()V�"?)�?���P���yR�e��\
�����������2ND�u?�����������:6K|0q*��#w��xo	Rq��������j���};�d�U=Tp���J7~�f	v^�I0:�k]	���;�.N9.6���;"U������s��-�Z��
�W������q9�]��)m�I���w.��q��O\W��5��x{V��T�9���u'����mp�lV� (�n$�y�����K6�h!�!�S�/�|�TEV/�_P����j�����q�M�������X"����0�g����� ��������I[*��T/'dW����H������Q���
�X�\������`t�y�4'�x5^�
��@����[��N���������
�����T��5[���/%Ai����&��c�jr`�����&�%�����3]B7���TK�'#�����@��y�7'����ld�IB���#�z��Y_�*t
bT��<W�b1��0�7��s��]��|�Xu(:.�l����������(�e���=#��D���TC�w/�D����(��0S��!��3/��)��2zNT���CZ\�����SA��e��0�� 2�]������������S�dv ��W�
�#�
rnQ�5R��^�e�g�E�rb;c[����zGX[$�i�p/q)1*U�L��I�hd�"=���ht2w�,j�V����.w��i�b��~)��|�����T��0tf B��[0�A�n�#�$�U	ChB��Z��N/a�4���!��,�0�|����&���`�1�H�����[5��!�B�(z�^a���T��:��^�p��� �,m�E�?�MWl����!31P
��&Y}�%�	��wEw_��]T||B�
�_l`����r^�;`I����� �@������D�-4�@�����E�� �I��8�q��h��.;�S����7QV+1�.�v��|�<��f��}w�D�����
�U��GK�CL�s�oDrk�Z>N�,y�SQ�dB�����@��aq�Gs	Ei��t1m��1�c!l=#�o���\���X3�"p�n`�7B�y���'s��y�iF���c�6'��bF�F���@]ZZ��M#lZ,|��h�\d��T~	����*���d���B.��� �d�l�T��N'R�N�;�Hw:��t"���h��C*���Q���5��`���-�8�����'�&�7��b`���}Z�7��9��e0,�!�37^��:�.��gn?s� ��S4|~q�}��)t)}
���@��;S�uX1� k��2*�v�)7.�-t�^xJs�*l���v�mw:�6i�9o���8j]-����s��d�?
;��C`e����t��zG� V�T��@d1a+i`�W]��iT�	e�xN�<�{���V)z9FtY�Nnf= 9c1@h���0�bxL Ft�6�I��]��G<�-Bj�����Wd'�����=�	�6 L`-�BL���'W��sx�V��0GR��ib�����\�	6�2���w�
$�&����J1p@������f��7�[O-��R+?�uYl���O ��8����m��{��cBs=~�`���)E�Sc�-�3�ij"��D���tS��&2MM�6u�����q��������1��Doq�k����d��� VU���.�Z�n��c�M4���m���gw5���-,�)�P���"�O��ND��L�y73s��Ocf���w�hiG>�.���&
^�b���!4hlw0����! 4Y�����d3#BXW�������%��@���Tm�2�4���`�#��B3�~�_��J7��������!�����O��Cs��PD�4m�� :�"}�2u��1��lH,�8^|t
� ��s���o=
��T��K���B�]r�sBC�K^/��A��.�{ Kl!am6k�vh���wC���K���@�N�����<����r6�gj��!B�f��)�1�B/�w�<V��S;wab��~���"�O�A����PTmC�Y�
�>9����*��FOe\n������^�o��6���_��{��31pB�l���_pT<tjB���y�T�7��� O!>����P��}��-G��@ �������m(�W��W���N���L/��������!��M����W���I"m6��'HG�ut�����K5�td�
72Z���J��.ds��UTZ�K����!���Y�LP�@���wqqEw������i����A��"����DM7���d4�o���q���l��dSt�#*�	f��^��&�p�6��i�9��
kM�Z>E�+���g_�gU��^N�,[�^~�E~a���
��sw	!	!	!	!	~����S�w�U2���8����7I4����tH��vI"'�(���R����)��k�����p�}/���
|��Q�S���������6\����!F������2���J�t�D�]c���r�X�s,\;�����C/�K�'f�\��K���Q�M��i�4��mm!��p�e�:����zd��Y�$LS�,��eU:2/���%��b��f$?�����~�F��[K�n�$
�a�P��������
���x5������F��y��7��&�
�p�]�n���R�I�s��T��������BgS7���] #>��C� :'�z0���,z��vJ���1�j]�9��|���>/_,Hz���4'vg-=������W������P�4�	�^7D�0�f�iR�^il�������0�	�B�)���v(���_�Ru�l�qa0@"0@"0@"@"���Yu)�����75=B���������1��e�5��60���y40�"��Y�W��
�MW�������#�j��A��1�l�0�|9�4M�>dE����!��?��"�y�*��]�G��#�[�L�t������e��7Vg�`��X��V��,�*����2�SF@v��22��i.$I/�Sd��}�=�g8*"�������@>��K���y�'��������/��G�Y/�����a�����7�Gz��M�A{��W6������y�W4z��jt�)�������9q��F4�2��Nd���H���D&���R�3�z'H�������d0?�un���x����yr{��k�{�=x��g�K�&0��/u��&xI[#C���!w�*E��$E1I����NeAP���V�P�M�=q'�A�������7���x�'��`�G����]��%D���"��
�ic��o37=L?����z����r�l��<��Mj4������ �)�z@��1������fpuu\Ji���6k��wf`+����`�0������k���g�c��6�D�nB�^2���t,����7��c����&@�Z��^$K|�2�qlF��Y�9���Q>��h��$��cUUg�O�m B�h�]S/&�s��x��W�\I�l������C(S��t���n��2Zw��	l7G�pr�g2��x'*�9F�n�"�}a��X���Uv���";��('�����8i.v8���;v�a���M�s��8���;���z(47����I:����h�TP��C:���p��aR�i5�"/M��i�	��b�Cd����_�$zFYh��){��MM��K�N�z�;�B�=���/��/t��k���n'�
���H����'�t��~V��j�c��O�I���S;][�+G���_������knQ�1�D��@���LU��r��I���W&��	�Y��������Do��n[4&f�%�p���x�+P�/}h��o��'��h��'���z��i��i�����$��-�Zjy��],�f���,����I���tp��x�LF�f&k
\8��3^=���������������������K�o�ge{4lc�)�n��2d�����q���f���w~�x��$$4UGUm|3�^���=��,����?6��u����U�B�������D�s��8�~:e~�O_�%h;��w�S��$���D�d����@�]nG.����Ak���u��R
�{�F�s��	��9K�1���t)Vj�q��*���07D9�3�dQ5������]6Z�;T
0�s�D�#���c��p���g��8�a�*8'�3�s�6�����c��"F��7B<�xh�Cb����,~g�����>�K��.\��G\^�I8��Zj+�f1��)?u�5R���X!���d8�jEvb������1�_���^D|}�g��K��kp�C����������n����6)�BC���� (O(�yLP�:�y=5P*��>�a��Y�����X.�R�1S���]��S�zgl���a���p3\�����c-d�j>��i��%F���	� ���e�j�
�q"#���hO.����so
A�@�RH�p��^���5��G�n>?�M���6��p
.W�V�8o���
v>�����V���N�#��now���2Lo�� ��T|��`MM��:��gL�����EJ��X�${�r$B�6z��	6Gy�ah.&Y�~*S��]o�|����=v�$���
0��S.���uH���_�oP@�-����]��F��
y0��Es:�/Q��k�`��?�/Qg�(�P�h$�����|�:���|�b&01��}E=��pRY1�� Xm7���6����n$������y�U]�����^����|��O�[��K�8��WFS�V+��V���,`�U*�fK=�2�d�G��-��ow���R��$�E��V2��z���A���l��xV���Q*CWX �@����/
B��h��f�W6z�!(�����Y,H���r!o��q,��,Hw���y�������n��_��`�AG�#���2��zU?��Q�r�6^?���N���|����,�\��p�TYbE��s��K��@�����\��&v��uRY�����<��� D��ny]Hy!���4�:��?�
E��Jc�\b��es�0������P6��z���[E<8l���D�b���N[���I���M�|3���`��K�����R3��(.��|�:�f�d����k(~9m�xVu�����i|e1�O��OxD���D��x���"�k�x2�4����f-oF*�{��Jg$��G,,zFB~�x�#X���<u����g6���D��+��Qk��Z���0����*��'	}h(.��$�ae���b�O]1�A�/��e�Tju{L�? ��J�J��o��/����K�7�G�5j�uf0�g�<�!$NmG�0m5r^����-9Yo�{)wJ<��E����`\�M��rkD	��Q�
����N��$��;��>�K�-� ��Q������\�c���h��rV�B��b�E"bSV�W����a)� $P
Q���B$��G����r"v��9�t�G,���uZuK��w&�G���������v�z9o��ky~��M�����j�r����k�����]g�_�@*��e��I��'UC�zB�+N,3����Z�"�����7X��7�k����M�t�wm�k���6	���lZ$�����S��#�����C����� E�4�V�|iG�����l�� ���^<�L1#(I�������<�G�F4WZ�����x �l�4�P�N�����5	�a���D� �S�-���^�����)����%�V��[�������s-���?n��6�����{o����#Z�o���n��f��A���@a+��):Jg�e���h
I�	�m*��Rp�b�5'�[1�!��e���(���r��-�H\$�2����h��c�E�.�~���l�)u�P��n2�x�_��-�<������,�������:�J)�����a��v�����E���m�X���^P�a'�\��M��?��--ZTu~��>�9�L�3�b��1\d��D��B�����!��r��T�[:�v�,CBF�JS3\1���������rd,_���Tj�Gm���������
�+0/U��?�����wjo�h����1��y5M2z�������K��n���\>Y��r��������@QX�O\�4L�W[��������tQ�U�R�t~����I��:e����f���k�������<E��^���h6V�[f�R�d����1�I�:��Ux��I����T��;��i�;��B	N{O $�|���4q�a�$�{��zf��������n�<���������2|R��bea�0C���o���_D�EA��*�=��6��mh/6�������|WP>!����>�-�k��Y�&����o4�Rb��|��"�@�N�_3�����B���Y5~�0�{()�Zn����Tt�Q��s��V~������W����;��E��=K�����V��K������W2�(��c���G0E%
�<�%����4��/t������S���I.��[%+���)��0P���-fz�����?�=�z%��rFmS�����p�6&/U%G#����`��lU$Y����Uy)9���`��t��5AU��v�-�oe�'8�iX�9���g�Nn��C����c6SV"~@|w����8X�49���f|��0�j	A������{�t�c���<��x_��A}�C�:} ����y����Q���@�q�&@�OJ���6y����
�+�o �A��W��NV5�F�H����x���4IU8�����OO�����=�����������������[�~�m~��v�)������'������%�]����9���q�]���O�,���(���u!����]
�Tn�Qc����&x�/�c���n"_~��<h�5��<��b�L���X�;�R}��A���X-1���N3�
� ��I3-�����U����aoiS
y�u���[:B� �V���@�i����&�vBp�����vaE�dE��9+S\n>�U��A��A�A0�A��A��A�\'Y�c0E:{�2J�<#���Z����N��a@rX��u	�	�4/;}��r�b���g=d���4�D_�y���+�5)Ma��,PJ~��x�H��(&OXR����	K��A����]�>�#c��Qu��<����b�%���n�~&��`?�tz�Q��*�
Ur�O�����7{��M[S�(S ���l��C��5+US,��05�lF(<���_&\g�T�e�u
�02����.�������@������wb7\$M��9���o����}o�_n�^�]WpS4Ge�O�XD�p�7����z�
!��W��a�YM���S#���>�d�%��#}A��U=��^��K�V!�@��.�{A�.�(|BZ,�/�tV�m���o�I�����������b����PWG`k��~g��cqPxL����q�_�dz(<�|-�����{���&
n/�&�D�"�T0'|���u'��947�X�=�65�z����oR�eW��b�
#^��
7J_�N���+���oxr���9�6��C�^�e��9��J�7�O���K@c��&1���������cLN��[���R&������^{�Ya��Hp
gt�����R=��HR�=.��p��b[)�Q�Z�U��AM�@Mk�w�3*��#rA!K�_��4�������l(��_+YR���S�vS>!�-JA���|�� �I~by������Km$���?ZB(J�}��2?��uD*���;���E
�\RX�E����b���_� l��P�l�)l�59�\��g�t�
w�����f�t����$�	&�XP\��'�7KJ9X�g�;Q��c��w%Y��gR�qeO�?�iy
��KS�A�B�#6�5�@�Ms���
���P03�`%e+�^r�RT�o���d�}�Cj�I���������-<kp�������~�-��$NM�b���������<�f���#lF�����wX�t��u�����#J�������+�K��j����?D����������e_�������1��aW+���d8�7]~	9���H�Gq���G���3��f@������ 8�Ol�
XT���i.g����������V�'��3�<��D3��������X��p��tp����������u"�|�'
�+����a��?�d��Z�������J�NJ���Thj�ZcUG�Fw�I-$���K4��>�<����s�(���x�{��F��G!;�ugfV�P��
��IF��<��4�4T���|��.s{�Q{
�%-z��y�f���������|dj�8N�#�_K�g���&��T8�RJ���h(�B�P����K%��d�G�v������Fnon�8�	���c�X%(�FC��j^e 0����n�8s�(�C�����+�yi�����o�D����KlQx���"2S�����w��8���
����[vZK��q(���;6�&Lu��8��I����y��(n�(ny����d��~%�EV�4��Q����a���3e2����>OA�Uq%���ML��&���Z���)4f�1.
�p`�e
C5�]Wx�� xA'K�X���yx�[���s�M ��&���9_�>�G����.��w�R������7n��fV��D��I+��N�0�# tx�2��	��O�m������d�eto����c�����9����p������������s!%����i������
�2(�t�[Q�;��K�'��[V����	lw=��Vn��	o�_��>�,�(�83u����m���-Pc�hoKY����SC2��'^��'
Zp|K�,�p�G
|���E����a�������@�p����E�f��x�����>��t��0��E�,�;/�T0��Y�,g����{�G��A��2���d0����N���)�
:sZs ����i��K��*���f��vb���;��&[��o�z�t&�8@ ��H�w��W-
����K!���Ga��*t���=`>o���n#���@R!l���/��o�*7\�����b��`&T��si��b�������5�gAt��Z{�K�������Yj@���+�>��bq]W_������BE�sJ����]�H��!����������p�f��;Cl\b�lM6����/�A��O� ��]�C0��0��&C���w���;������S���Q\�X��7��� ����p��Zx�&/�J�2�O�+u\����O�a�oy�b��kXr��2�����Q��V�6��*�a�
�����*�?�Jl�:=�I�"������Sp�%4��q�~$�1�3��1HcX�l�0��0��8�3i�<#uI#�+O��)��g�	����{a�WX������Mo�n��=GC�=���E���J�"��#�\��%6L-(�F�}&��w=��\�	(v���[��uP����[L�W(�?V�(FS?5��e�Xm4_�&f @��D*4F6���f(>r&�44��^�1�; �]XI���^k�����+�����5l3JmlG��W�QK�5���DMW%��,�l,
�����a��������.�M�Pk��b8���r�w������R�K����A��E��C���Up'-q��RG���%��vL"s��&����l��G�����S'���]Bfq#!�H�u�r�i��X���h�������W���=��9Y���i��1��BF��8�b8���8|�NJ"\����7|��������W_�T�K���M��QV�C��,�`}/����p��'P V-����G���O���r_�dY�!Os��������/%yyq�������[q@1�Ko ��=( *WYu[b�2��Y{�i]-���rif����&��K/}Mk��������X]�~��^����*�t��	�����a�C>���
�8���c[gM�!UZ3u	n���U��:hN/|�m������677�R:W��v�g�u�l������%������^��
�#_+��������p��N	����#����K�/�,~Y����6W��S����F�A�Zw(+Gy6�
�\�Z�H�]����Q���^|��59�<��C���;(������|we�*�9�����C�J�z}� Y�z��::�=��O%Jd[:`y�hA���^'�y	��aT0���x=@�:���J�/�X$J��\e���o���3����P\Z��E�z%|
R�������%��t�[R�Q\�(��N4��Vy���Z�d���rz��}����^
DWt	����!���P�����wA��|wpbA�����c��Y���ij���I2�<N���� �j�&L�������d�G�:��$�5�1GwL���5�(h�
�����0� Y_�4K��>����>��0���H�"�uKE~/��D�������/JY�e���<���*�p�PK��6G|b5�/y1_perf_reports/20GB_preload/ps_0_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt�]y��D��O��1j�Hscb��g�}�+��f� 0���w���af��.
]���xof����>�����E�VW^x��a��*��"�=�	�D��0w�4�7����p�MQ7y��M���,jV��k�gCs4�x��=�05CG�������RT)�U����Z"��5����< �U\�j���rho�l��
���:����������{Q����/�a���[��3��
4\���~�W�����w���|�y��_s���h�
��mQ�K��eq��,��]Yw�t�����Q����n%��J����������Zq+��l��-��Q����R~�_cq�����|w��O��Wqoh�����*?�*����N���k�vY-�����%�������@����y�������_~�����<-�=F������1�
K�����C�����\�aF��'���ny�9�'�H�N{y�Y/�������JX�Uq���&�\�uu��]���&����U���9|zX���r��,������mTl�-�+����(�|����-{�m��X	#|]���������H���������4�g�^]M���_�8�����_|�g�u,�zS4A"����>w!���#�j�\?�v �r�Z��
�R��&U�r
����4����ea9���MP)��_��&h�pnvG���4�];T>�~z�u���K��%��s�/�K�V}O�y����~�z��<��aU7UoK�Xfzc<�����;>�������w�����h�jVS�OK�,�v_�&l��)��v�L���q}&w�zc��	����R��U���z���|t_���eU T��\��\���(_���S.
�$���}�U?J�Ll�g�fS�o��������[������`	�_U�4�'(���t�45����e�����9��l+�4	���~L]s<�5����ad,��$�}P9O`������ ��Ng��9x h�K��u�BP��Z&)�|�Ua��l�Ls"��j6��~��S�`����k�����*�ZM�)��p �����)�k��.AQ���!6<����~�i?{'�e�Uj
l��) ��;�M�!#�W����Nbh�������)�ib����2L� �j�^�9��J:z��45UmZ�R:�i-���k��6�� 	��VR�L<�R`h]�(�����#�I]�4ne�(%�A����K/R<��~+&rda�����5�b�
x!GL��1���;2ok�6l�����c������5��a2����n�����M�n�Y���=j��'�m���=z��G�?�Gc����9�?��8w;fb&�&��^�{����f�����q�j�$\Y��6F=�>��r�s2I�	/���6�G�`��G�l�5���O��+������l���oC���a����hk~���>����<�GZ��?���$2��6��e2l���}�l��;�`e&K&�]YJ����q�[�3�Y=���X��X[8������i��1��������m��"��������Y��C�;��j���O��7����9�Yy]��o��Y��5DO��1�.RX����@�����R���&M�u&o�p���f�MK��'�����y��<�����fC����.����[��|)���RA�T�L2�o{:$���!�iB�R3�� ����q"]PA3������5�p�3���{6'/�sa���Ol���&O"27B�C2��G1y�8%3�<�"��S'%���er*��D�KqB/!�����D"dy`��G����������w��S�,�8��|C~���5�S��d-�c'��\�T����8�Qh�t7�27]�![p�~��D�A�,�m�E�D�����l�l_sW�}��0��'��%��<"~C�����gE�CEGZD�8�}�.3@��d!�#7�V�.����E8�M�C������ +���c��p#Nu2��������;���gK��t��[dE$�0N^�`j�)�sC���������\��|������!���,�J�$%�3�s����b�
rR�A��x��#�1������G��k��A6LnZ���i&t[��W��q�^a�;��O'4S2
Dp�����{�0���/Ka����3�d+�'N����7,���
�&Y�r��+���w=��������qLNvis��l�\�� �	^��d8qp�s��Y��'�E"\a�����BR������d���c��7`��%���m��a�E�������>'�tJ��'�aD�8�G�E?����
�T�����m�O�c��_��v�E��y=�&�r����M�d1�o�����~�'�t����$N���I���B�1}t�4N�}�]�����-�NV���I	�����W��]z�g��!�y��;���1,C���B��q$����oD�,�08�7ess�R�����������\_3��1	�=�y���q��9���p���Jgz��7O�/�+g-R����dy�����Z5K�,�S��:
�L+�����$0=�X��������,��j_f���4S-�������8}�
�G&���b��p ���q>P�`��$ ������4�&�Mf������,n��u]��5
H��6��a�V�6�$���8��r���>���|h���2�����V�u��=h�X���j��gH`���f�]a����tZm�%�2���<,��b^ZL�������>��wj�l��p`��CtxJ�3XA�p3����6�
�}&���l�j�V)��p f��S{cA���a�~���3���I$�7)������d�����p�}��+,�U����8}�c��Z�l� x".��@E�N�{8PC��o��7��xS�B��b����-�3���
�X�^H����`~+���H �iQ<���%�~�T��1��bRl�TH<+��#�6�s�c9p*�7Y�\+�V1�gm�6.Y�f�a80B+�E�e���Z�Z�W2�����0:H��^VA�c���"C��@�9��ea�U7Cq$�������-��������z��S*��5��hM(](�9C�H����(MrZ�`�ek�9��mi����p����y�T���=x�
�������'_����z����_C���]��v[��G�&8$������Y<Mn$���`��b������m���WC�"�@m8a����Z��w��[07������$���BX5�zFM�t�3�Esx��l�)2��`��l�h�t�n[}lD���6�I�%[y�s��d|>8k���f�����'���B��~���Lgs,c��t�n?��_��D�z(rq�.4�zA��p�;�{�p`���_��R�,�&��8��j^�������_���k��<���HS����f�7n�	�:�*b����#���������-��V��|���'�=���O��9�Z�����Zj����c��a�BH����d�3�P�i>D�k�^G�[u�+��-a//@H�XO@�������j��7���������"�`�W��l&��a��%�
E�
���n;�AR��/�{������.�pp�&<y&w�r��P��.��a�&��#�L�z�50-�;}���U����7��y���UVN�;���5�yp�&���~k�������N�u����P b�?fST�^!0�BJ�.�}r*
U�t�^z3p3�FIP�)����Y�K��Z���������������2h'�-���rf$��-��u���s�JM�X�U�[:���D�|}KA���&N���:��'���!l��o4��6�[m��i�b����f0Q�j������������u[�^���~.��g6��o��,�^2�����_��5��25���RR�E����LB)\����.��d������*�^^�������+����<����/�)�O��9r��.!�\BLm	1�%�T��_B��n	)���q`,\%�
]����� ~Z��{�QG~���QFFE�����t���A��)���[�X��6�u������oT��.j�����������P�-�?a��\x����>z��:� ��V�J��]d��{�Eo�w������n�@���� Q������1[y��d�c����G����!w��qb^ �kk�`N��.*���;��Qd�-b�sf
t����[�=��d
7�\�/_��<���v�j��� �G]y��U���C����u7���b
ez����lx�4iO\�Mw�~r!��%@�����	7���&������x
H��0?nE�k��X��UP�))t� RhC���Jm9�7A��IV��L[�w�s|�����R��a���h�<$7p��A��(n� l����Q���j���CI]?X_ex
�7,�be��}p����x��@�U���L�4a�/]~�t�6�i5P�%xx���<��N��rR�*�O�J�^������g�\].b�H���w�^ �Jf��^E^���g�\g��.�ar��j?@r
����/d����MWz|�V��i��+��*��������=��k�V�,`�s�����-�}H�
������j����KC [{����m�3t&�����MN���d8y�T� �I>�.����c������D�P���mW%�Q{�
�+����s��J��K��8^��5��������DX\5I-[�4����}sU�k7{�D!L����!Y�u���i�����x����(�������#>����V1"H���Z���HP�P����Bc��Vb��\y%*���79�$��vU?�����0-���������J�U@���dJ�����<~��*�������V������M����RmRw�����H#{�Y�iP5�gk������4�gAp�y�/�5~V����=-��������wU�V��)�0S���)[>�����v�L�M��sr(�b{X�!�r�5�����
�f?0K���9b��1��G1f��[`�:�N�	�>Q�L�L;-�u���������o�l��T??8U�\Ms�Cn�����Pa�^W���W�Q�x�Wt�A���t�<B
�S�r�ae�oXP�J��.S\��p~�2��U��}98
P�$ReTK�O��������	�p���Rs���K����>e��j��esy��-���r�ld_=���X�|.��o$�CuUG���m���MI��s���N���<�.��C�����j+�*�Ly��������y�%(<�8�yH1������E����\�����!!��XG��$�Ty0���K�6���	�q.hSL��|>�#�;M�<�������m���C�=B������U$�������#��;���: �mE{[������a��Nd(�
���iC�q�i����X;�q�d��������a�|
�Y�Rg?�����0]0���b)������@��e��|_�we��|W���l����X�C�_��Vo�56D���W<��j~F������:��Q5�P��W"~�a�n��V��%5��IqO:
292��n����J��q=��d$����r��]��v9Z���v�|4�]�8��p�����p�U���D�Z?
F�����G�31�a��	�������s���Mj�s�u<�Y��UQ=�V&��[���,�* W������>���L�tN���Y��+
gG�=9.�.����'9.�[k�-�XC9�Y�a�{���������i��Y�#~X��X���~�%����To�P�z~k�a�:o.�'�f��>zwY��`�	.U_��I�,;p����QB;��8��t]�M@}I�q�r�)�=D���I��K_��&�/Q�����+�����	��p��&����n���b���G/X\��@m��r����sf5`�4<q$���_Jj�E��!S[�Le��E������bw�������n�a���P,����]nfs������'7��x-r�����.S�"ng��
�~����)�E���bpA;X��axk����Lhw���
�Y�U@��`����g�G�3"l�f?������!<]Y����!<]QuB��4�Bx�2����=���E�������&q�L�?1��^0q�b�Z��a�_��3�%g���3i;�C_��^5�Mg���+��n}I���H?t�i����H E�x�U���|	Zv�/h?���d`k��*���(9��i!�I6x�OY�$������#P��wHd�{����I@���m��b@��Q$�.���^�����]�Wj�3Cyx0G�x1���h�H�?�,������f1�^��cy�m���6`�t�
~��bQ;��"��V�������0�t�
p��q����@d�?��>e�V��&�C�=o������<������
�h�2E�o���0�$�	�?dFG��)ZS�`1�``s�K`�
�B�w��;��l�]�;w$�O�������#�"�/i�]u*t�${(�Cq�a�c��cQ;gbs#�����H,n$7R07Z���3��F:��7���iZ�2�g;n0h%�4�
&C4`�P��M�$L��	��
-����_�c�g0��$"tR��I�Z����:~�#�U�*�g��dz�@������N>���x��E��7W��Z{�?��J<��)����������H����
����I�����t�	Z[{�y�Jy�2�%k��L�7���'���Y=$\{4nE�v����E&�:�� s]N��>�fs�f�-����_
5WV"}�2W��a�6;,�.+u�Via�^�Z��o�%��"����g��E�=��x=�[s�
5����4���6�������T�r$'�U�M�G-=�O�����B��8�,H�OT�y��~b�UVP�G}���
h�i..
8V,"Z2{����/��G@�Q`<$�W��_�.�^0��l���$���za2�0�>�v�!�����h[$�����Z"���xs"='�5��8���z����������U�^�m��]c�Q\5{��lp�n�Z�GC��"�x�X��0g���8���b1�PG��8qW�8+����f�s�m#��������'�3���]^�\Q�	��D
���i����/���Yvt-���G!�g~dTA�f��d�k�OT�{z���Ly��'?j���3{��6?���J��b�+�dW:�RdvuW^z1b/�1}a�|a�{!!{�8}4���N�M�}����7������ni�Fw}��_��+[�O���H����k���R���YI�K�����l��dY���`���P9}��}�a��g5~�����w8����V��W_0��)�� �a��T�-$@{�r���Pj_p����� o�%p7�%V'u+Y���hC���2Ww��F�W���6#��q�x�g�7Ss^?gFq�F��-�.O���8o���b�����Y��)f����k����e uZ��S]&�+'-��������r[��U�P~%�������z�c�4n���}E�B]w�j}V^��1�s����*�������[���OM5�#�T�S�&����x���1�T��e!0U�q��0U��������������5B�d�JS��CT&�����#*bg��F�|m0�.F?E��Q�<i)� P]�d��:�*����zW~)V�i����]������P�� ��������)WiB�Z!�E�y����o��Y��\%��T�yI)�����~���my���f����.�/���o"TY�m�kl:?�M^|����~�|K�������6��n��g���G(���� ��A��RK��/�ew�Wj�����T������,�"(e�%�L�@��@�[f�G��A�q?�x*�c���F�T�fc|A����Bn����H�<(h�ZC�$bv
f�"��k����K�<(�O��#a�90z�iI��X�ca������dn;hOh�Pr��G
E���h�SQI�_ hQ�F�F�r�0N;(T=����<�t������q�.����~�D�M����c	�aH6r��{
��
E�y�J�;UT+r|� �!�}�}�������W[�Q�f�Yy��b�+����Q�=���8��8d>���GWw%`Vk�)mE�V��&�t�1�0�S�hpc��!�R����)��`�	m��X�����7_��,���P�D'�St�})����k����4d-�)x��������*�N3�N!�����Z���l��b�������YQc����PX�}P�M�+�\=�]�1
�P6������>>D��X���d:}n�la���hR���^y]LU+.�F���:@-����]j^��;��fw�����p|fB9��6L^a0�V�?����;�Z���U6f�F8Dc/e_P���i��/���
��B�o3���2���/����J�G��Mr� 	�L��^^�
���q~K�����|�?���)B����?s��E�`��]>��fN���p�cl�9n!�K��9��)B{!Jr:���Y�aW�U�������/����mb�
b�
b�
b�
���i� ��b��kWX�'�����T��:-YML�F�|2�i��}�>���x�~�R���$�.I�e���qs����4Q�?���h�
���c���
������a�W2�!*�d(�l�'(a��KN%70X��N=����!
�Cb:���Aa����[5a$�'�-��	�3���������#P�DQ�`��efa^W�!�����p��;=���o����'����ZB�M��_�e�<�Z@�S���%;������������n����"a�'L��a�����AXC�9�MI}��sgFM9��k��g%\��N����6������w�_EU�nK��*q;L���0�G��E�t#�u�u�sudc���A50��f�r�3�����[��!\�Wa�����w�x��;WE�k�����K����)��^I3I�@�Pm�:t�.���fNI|�l�45�h��A#xB���Q���FcD���?����Jt�poFO���
��������~��W��|eQ��A����l�t��cl��s���Q�Dd6���i�L����]��|����a'.��!�oz������#q�&��C�_V�m���+D����2]���:5�~'|��+���_X}4�JM���Mn�d���������qy��.����I6c%�V�\��$]<1���5��gVO=��yQ�Z��=5�w��%YQ��9nHG^g����8�
�6;fF�R�a�/��[�e^��B	m���&����f.����`�F|o��-���2� ���H����K-�"�=;��p����H'�#��"x�fc����}�S��O[���2R���{�e�Z%����`M0xf��������E2��������R���4���%Y�P�t��L��7i��������Z�{`�9�wsDb�{��#J�����b�v�B@������8M;_Y<���g3V3�1t9;����Pk����^P��
���������0@���2�����������WtaU�i iWv%x�������J������6_�*�C�a�;2-+��IZ�o����Wp��8Ff�F��5�[)����k�}�E�a��Ou�M����l�))*���A#l�	[nB����M��&�r[�k�;y[�����5�W��v���n���J;�#
�8
��rD/����W1�L�W����H�KX��Z��5�kJ���
�c�����Z@L&���h���K]�!�~�����yj?n���p�U�����^F",+����eA'��6��n�����!	d��~�� ��9��1ERiH��J��"EA-��O	B�:B����v������A(����1_��r;����f:�l�rqxEB���f���!}��w�>�0����d���;:���]�}�g�����?��2Y�UJ�~�{>���^"�����2V�����l�QD���2S����������G����^���y�"�RR��">��G�����^r&��������H���A�S�6L]�p�L��~^�S���H2�w�P��Z����J`��Q�*������_k�&V��[x�u�w�1RF*BHE~R������8�g@���������"��������{��i�{Z|��8����/7[�)����Y������J���q&~��A����0,rg��Z�{}�������xG���"�Z�����.:1���N�Eo�D�2��xW8d�
R�EC/�^��.H ��o��"��YW��c�T�K#T%�7�Z/)/
��g
�%����w,;N��;_a�H@B�"�~�J��H���fY�����q�q��n_@�����3��ugl�,F�"�qlK�������n���2�AZ�LB��O
tT�M�Wi�	�^�|91La��w8�%,'�9�/����e{����5�T��5oD����o�Na�0��0��0���Yk��*�Ml�|D��3�4_ ��H�`�$Y�;���r��M���=���f���Vn���>���q�a��r��O�w�drn<�]�zIW-���]����%�e1@�f�Q�"R�=[����3.2_���$� �:a�XV����P� 5Q<5Q5m;���
_b�Gk=u�Rb��@1� ��o4�#3������9�#��?kq3},c�O����� \�p-����&�HpM��`F�t�B�os�X<s���}��������QH�c���F����u)A����@8�?1$����	(V=U�P��Y
Q���V���x���R�w*�P�i�:'4�F����@���Gc�D�|�,�%{R0���@�-Z?����>z����w�`kT#��9�G�gN�r��� Pr���
�V�������H`
6��'�g��S0��Ex9f�����P�K�����7�R�l?��t1�p���i*� ����x�H[����mHz]�P��^]-�g����p���c���qCY�U�t���"����V���H#;9;�:Vk�!��8��_��Rc�6�����0�,���-��m�TA8&�LO�z@%��j�<d�F2O$0h����a��1��9$x�h�*�@#	����G#')P������[E�6-��A�c.�
�$��P
�L���=�R�+���,O��G"���������f��Sx[6%���w�$���rb�c��e)���V���U`P���Wg��	��o������4{N2~3����'�a9�$���.�%��B�YC4G�A`A�V9�b]qq�����(y��|�Y�Od��<7���4@1
�m(X�v����r���[�+&�1����A��{^a�v��j�MHn���1���/�YM�QR��M�;��]����D�w5PP�b��mR�u��:�&6�,�������,l�s�6��}"j�<�y	��A8y���*�j��
�w�N�L�f��l���x%���=�[��v���4������	*��B<�&��4"��}K�
�i�D\�j��X6n�A���z���w�k
It
�P#�t�aMj���eY�=��x������%c:�9B��L��$P_��f=~���?|��O����dxp�����z��\
,��B��O�)��4�<A�� *�5�p+L�&��W�v'%h�u��C�[� �b���**�j�\l�w�R��X98o.�;�M�
�����3'�u�_%U��0#-���������R�V�j�b��,Zj��y�|��H����l�c����,���k�go)1G�Pl�X�����[m@�D�mU��k�f���:XuG�k1���Xy����X���IUZrSe���=�%5cVwY/��:�I^{a�����������w��2�k����F��	����\�c�k6����&l
�S�AZ�s��~����qhJ�d_0��*������h� RA���;���S�����M��}%��W���������0T�=�Broc]I�)Q���7���2!���?j�yWp�V��PK��6G��s��V�_perf_reports/20GB_preload/ps_2_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]{��4���"B��[�l���N<Z��B��8���I.�����8���'�R�n�hoc��_���=;�)������I1[	�O��)S��P�uE[|2�?�6��j�����{�|������f+���m}������b=��3}����n�$Kr�,r�[�n��;���n�������
�hz���-�����^y��,�����0���)�����_>�Iy��0�UQ����J�����B�
E�f~
�Y�������
R� ���g� ��������P���e�v���4�>6]���u#f��\l��P����Z�����	��*�S���<?T����T*V�;���}W���?��'��<����k�����?�)�~k���1#n���6�����?@��G���)�#�?y����S\�	�����*)�����b�q�����?}�9�`-�����P-���o��)*����>���V����0����G�]_hK}��U-�y"�c�V�2���u�g;�����'�����~o���Ki�������'i��r��������������g(�0>O����������|y������e��������t�5��i=�R�~+C�Bx�'�g�%��-��$"�>l�`X���L�ci�^��9A�'	��,C��P�1f��fmM�,��]C�1l�]-����=�f_�����x���K~��$#r�"����v4���qsD^���d7n��2�G(����?{I�L3�@��^9�\-�*k���z��;��?����Vhu�5oH���Q�3������P�� �O\���qX�J���*��e�� z�$
"
���V
K��_��������jM�!�B��,���h��z�b���k���LFjZt�5C�{O�"*�D�I�;WM����)%BAB����,o����h T<*��G���u<�L��(I1<*FZt9��)��)XF�&�n����TL3�|6S!Cb��3Gk�L�� b����6BEE�D���e�7$K�B(����Oh�� x�����	���PSAWJ��1`���&�4�u�l����m���I�@��.OBp�!Tl��Bn��f
�X �L&�po����)F�0k5*�d9��f��� 
S7U56+jiSA4�P��,��ot��E����tYse-TY}��i�9u[E[j�b!��8���F����!��T�ZJ�h��dA4}��,��;���8��I���6�����R�7;�b)�eiZ�t������
]HW�F�������U�Y�6�j���M�z����Y��\Z�4�8i����#�b��J�j�f��W�m�Ki���K�I+I�0[���'0@�
�+Ym�j��C
�������
n9wTy*&5W�
�F��v��L~�a9�!����|%�B���K�ma�7'�,�J�p����A�B_QY���� �
�cm�:�T�������Rc9[h��B=3F/��Z�w����5G�i������W=��4�wA�����J6<���)�U=c�����":x�����i"��0��l-����$DM�	M���xM �!&V��!�A(�L��~�E�@�&���5H���KA���,y���bO�s��%�It��3� ��;���j����Kx
�!9��~��3=����e�/3��v
�M�P�h�c�5�}!qP�l��!NV�,�gx�+�x����At�g��1����C�������<�~L3cqe?�H(f���D(u�~�QY��]��f|������������)I���B1�p������`�qt�Yk�8H+�y�i��p'���M1B����hy�2���)���.����13��Hm�B�����dC�|������5�
%_D{�09�B��)�$id��/|��Z=W��<FjV�!c���y�d��3�0F��[;���F�Q����/����@w|��EU��E
*����CpZ	IS����'~�E��f���bb�����y[�|��X��+��m^�3������W��?�+eD���\���w!�$���S�rA�G+�V��cb�!��<�*R*:�8.=��
b�d�P�&����P)$8|�y�����U
�CA�Ik���h�dQ>dr������g�����)�5�G�����n
�N�PS@O@����C�� �������ijY�>O�0�" f�`j6���r�k���{R���#t���G ��� A����~U����>BMg�,�������S Hp��y�]���u�)m	*n�����<���QC'�9�!	B���!�D0�2D`|U���d��5��o�C��%2y�f�#v�0�z�'r�-1sGt�s�?�3��u���1�P�#���l���8�/H�W��6�'q��^11��s��l�zN��{Y��e����K��R��!{��JC�/(� �,����%�GAC0��?P�!��.B)2��(
�b1�6iftR��%�%;�%��tx��`��}w>����[�G�����.�v�
�*!v��|�W���.y�*-Vv0W=��]n_�
{��
��d+�V�X����DM`?���*4<�_�����Y+���_��|8�=�1{v��(�}����"n�A{� ��AAJG}u�b���$����������Lp<8� A(��l��N�*$Hd��
`d3�o
:}DO}Y(�"����D���k0��j�/�� kK�$�K�<	��|��q5p\����/��v�#H�)=�_El�<�� A��h|����m�����#�TPH���
���"|�V,Vg�s�������G�dKU�����#�h�/]X�O���%T�v	1��x� Hp��&�v���"8�$���o����')*�
F�0��E�,;Y"��NB8�����e�ao"'���JL���x��c�����Cp�!�%�Jt�x�y����jX;������t���H"�����t���f��������K����|�s��W�����[��.f�zA�>������xS�r�Q|)a_�J1���N���W�x��U�8m*�X��X|{
���i��L��5�3�6�=�X���-F	�0#E���g�p�4���X���U)K�LY
ak�"H�����������4�����u��c$#�d)����S�*r��~�R>�s��X��p�_.��Uc���#6��!���O�\L�H�m�������d� ��?]�+�%l��
bG��B',j"�'��*�p�jyA��}���i��:^�>U[fj���������
_�#�����Zl�S���~�LB��&��������	�(�����]�1]Xv���{�eK0f0���O�\��0� Hp�4H�u�c��$�c��`��Q�y^����X%N��_�
�M"�2`W��#Jr���f����01P_a�L��@A�M�-�b�p�&�����+~����������>^'�ds�=�i�Hc��[k�r������&��o�)������	�]��RT��2�����h���i��hLf�ZYoeE��D++�VVp�,Xm�U�G��B�%<p&���.bf�c���Th!H0�����5S0Lj	
�A#C�F@F����|���4���,^���7<��>w#]��X]����!>P�F�i��7dz��)����&��, �����S!H��j_+���b�[+������?���������[z�/�P��1����|�a�|.���i����oh�VV�RK�������BX���R>}�>E�_����C���J���W�����/Vc-�F�Z����5���W��O2*�m���b*�l��D���J��z�[��	�iu��6R~D]��FP�]l��p������j�p��)���c��{��JCb��
"a%Y?���N��}����a�[�`9�}X
�NA�?6�?���(z����zI0�F�����z�!mf����.;F{���5B��b�4�'���\P�8"1up3��\ ����N���#�>%��~����Xw�e3���!�LVj������X�����O�?R��T����b���������Zx�?�-zT�!#]�����\�iQmI���u�]y�'j��R`�ZxTW'MZ0CK�55P�FJB��n4""W�Y�'����x�<=�6� �8����
Ax�'�����t��%���������	F�.��as_�.�}�y���
((��+L��F~�`�l���!6��<I������������.z�b1���~1h�a��-1�o���1��^�w�S�!zu�R"�lk�7a)2����VBF����z�W_0��>���8u8�����X2�+IL�W����HA��M�D`��D��+^"����N���M���'�2�����A/'Za/z(���{�^�:��j���3QE���|m$=��&.�9A�'���?;�[v�-���M�WLA�M3[���$���`��M�A�D��

*6{��l��sA�#�y��#75e,�
�
�P4�/*�K�����8�D�h�|@b���I$��;�K�P��I�4v:<��D

5�0�W��e�O�.�o�Mf��xZ2������/{��NA�L���%�T�K� ��>?�<���� �y,MGq������R����
~�����[�
�K�y�����n?J�(������69K[g��'w��� ��w	Gu!c��"�����j�9,!�2������nb%���c���������j%|�q$��.2�v����e���IV>
��
zC��$�N����D�f�����S��	!�8�I���3�xj�~0�P�v�H!�v�T�����B;
����g��G�?��*q�D�u��J��g^������#��lA���?��C,�E��[�\o�A�q�r�;}�y�0p�S\TU���G��]����e����:�����G�EH4!�@Ib���m���+����8�*���_���,�bs��B�t��bC����4�IXfZ3t��
E�A{+����~d�c����� �Tb���2C��p�)���`�nR@��*��1p�A�s8�t�(����K�P�������ZL�u�����H������B���{2]�s�%�������c����e�������o!�O�k8�]��zntc����W��3�B�&�-4���oU���U}�g{���I~l����� B�����g������@�nk���H���o+�	M�=�_X��'��Y���{�Q[����O�+gk@Ys�o�E�^��"�ZY^{M�z	��X���|{b��:����2S��0�F�����t�j3o+����H�FX.�]�=����,��{��O|V�7�M�M�7�bO�b�Y�����5�#�D0��fwC�+�8/�G�ph%b�<Y�8F76���E�	�$L�2��]8L<��lV5��������)��O�\D!G� ���T����1{0��c�s$�*U�}U��IS�%�Xj����=�w~,�}"�5��im��Ik2&<zA@��=����be���w-:q~ni�z��$�'p�GG�2��Ng5?�b�=�R��^��^!���pZ��|��"��a9��j.u��%*��y���[zTe!Z-�h���;��;���Ys�Ud��a���h�E#�:O#t\�%a(Z�����7�ry�7����{�S>�*������:<�@O	������C��I�J~<6�^�����<���#��PK��6G�o��@��_perf_reports/20GB_preload/ps_2_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt�k��4�;�b$��6�I��HH�q<��;^B(J��MrI�������l�4�qr�r�v��Om�{l����������	�]xv�T���05����Nu�'�9�������;�_t|����c�X�L�����R�
M����^,M��a�PD�Q�~k]���[��s�5K~�����y>v�V�[�����+b�~�'b��wo�O"�����|O�W3s>�	��h,&�|�G��s��'�? ���#��l�h�o� �w�� ������t6[���C�v��8�s��w���N�s#��W[�
a�����O�f:���s��D���P~�<'�7�V~~Q~��~��������K�^|���UN����A���� D��XX^���\��'�K�����=������}���_|������ i{�����~��
�^�����W���a�"��S{�����9�|����M��$}�����Q�B{�e6�N'����S8_�����������yyzm�����7�$@-O���`�]hy��O�B|}^�w���l�@����D|������
�
;|v��|j'^��D���7T~���_'����.�y-�7W@l�<��7����J8.K..�.H,Wl�ho%�G���R���dh�����v�tm�kqP8�I2'�����S'\��>�^]]�t�X���}��D7���]Y��������.�_�_;[pg�����XH~o��%�V����p.��otm������0'�IQ�`���
�`���pkGqY�>l���
N��>�"���7�� S��P�B��J|����-�w��]"b+��<+T�x)�����"��$:���N��>�����p��o���IP��N��]�U�T$1�689zSm�&����
t�v�r8"��t��L�/^��=�c8%:K�i�&
�U���p*lN����vc�v��>���|�+Xq��������������2$3M�����w�H�;+	��q�Z���b^�����dA�m}{G�RAM���h�i�D���z�ZnP���V�Ru��\������s���w�%��WI��r�]��|��e��x�d����E2���q�'�*U���v8(�7�pn���Ebr����T�4k�Z�����a{����q]��7o#MNsvZ�$�	v��������[�QW��["��&m�-;����Y�m�NQ�k�� I
�O�z&�._��s=����vQ����B�e��Ci�Jt��?`�"p�H��4����A;�L��o��e�c�,��d���C��nq@�`X?7���g%�S�U����ck�k�;a��lt�,�-�����UID2�`��?����fHF��\����9�v�@JV�(��
u�r�9)������\K�7Wu�����Uj�8���]���%��7��Eci�S(�	�=�p��L������rff&w=��!��Vd���=gf��Th��h}��6�w��p
\���'��9����S������%���
Q�VV�*
p*��k��R�jz��������m����vTJ89*hz�9}>�����T�-�0���B�"�
��q������_@N�P�T������\��B�L4}q.R�J���wK'(���6�)k�U������8�D�
&��>0��:T�>01�EmL���"F���z_�},_~b5t���������,��o�.�c�5b2��emD'C�1[�m���:&����1����5I���4�����5�e�1lh��}����>��1�9���t�@0Y��Kt�����q��k���(��%�}v>��^$P#&�>0Ak��Q�q?��z�_.{�^l���;����>t�> �e�(�
��d/���Lr���s��������/0�������=$lN�T�\|���0D��J�q���|���|��Ei�Nx-'2��(�.�B&��Mq��\��%������;s����%���,
n�����=VB�����N��tiS�i��h:%N
���P@�Pi(�q(���N���[�w0/g��vS�����pJ�K6�����=��[JH��0S�Wf�x� d�P��k��2[���]?	�4��M��u��u��	LX��t��a�d������ ���JJ
7P
�����neF+���X�<_w8@��MD4����S����I����J5cYj�=�T�����nEJBV��SH5��M���vy��Z�i	��+ �����Pj#�eUf[���53���&�.���&�w���Ed'����6;�XM�+S�a�h�Mn4�SY�]:o\�P��Z�m��U�*��%�#"��j�R��,F�Y��@"��>�3�sc{|��!��(�n�D�5y#I�Qq{I�Qu�I�Qy+�`�������`F���2�����0.6_H��+���'�"Q���v"�0��	5��;8"�Sc#faXV����qk5+�B/����%[������Z��<��d�i��g������o�#��6�%�	A�O*����WtAA+PQ�����g7P���A�#�����&��6��|�p�0$/aQ��H#y�+y����eql��6G�(ZKa��`X����*�����C�^x	������h����E��$�.\}Dl�������<[�)N��`G�1�*��e��LwlB�&�So�wI�:�_�������*B�B�BG-��BQ���+�?��B�����W�+�@n<��=�p � gV8
���kSZ��f���iB��������A�����K�o�������3�8�t���7�^�og&eG�n�6fR��V���Z.�%�*�'p|�'��4�b��L�>������
��p
�B��4.-�o����P�B�}�d�!r��
1�}���gV<�S����T3�oFu��=dt>4����yMW�R�q�u7����d��.�L�;R<��I����2�N�;��t����"/�F�+K������#����,�f@moE���
;�ht�e>����p����������:/�q<�Sa���X�B��{2����6-p*���5����'].g���;��<�}�������n�� �f%�r>*�OnZ7�Nm�xSU`��8=��*��-t���p*L�[��"�*�6�Tx��w�q���Q�+�S�����4	"1��Z��bo/v|}�:�����(\pn�Z�$�I���r���:�Fd��p*��Z����|�����i�n�r4�;��(+�B��mQ��4j��0#G���1r<��F�h����"�% �%�^K@ZK���z��R�}�;����Q,�j#��}_K��|[x����22����~) N>P'������}��5�@�|��| M�,�r����8:_sv�/�I�"�c5���b�_�����s���)pR�� =�
ok{i����t�{R�|��s�Ia��,=h�Q(AS��p5}T��T�}��&�w8tc�����l��}�m:V���)�
!���\:�������w0Ox�����&�v���:n�y%��d��-V�6�;��1���<����
g��C{$xn���&x������I�f~������G�V��G����G���G�^��c�c0�����kd����~'0/�A=�f ,�
��C;SQfB�'3��223�y3�y��5�h�:���������x���9Xi�R:�k�R�O�����a~X��S����k������������.l]��8(�����AA]��8puqP���������6�c����";����$�p���cf��0H1L�|�0(Y��j��\�0�<@^M<���:.��1��@�w��)*R�R�R��,�JY�KY`IY�JY���E��`�E��LL���+��0�}v��"��E$���o������SQm�m
;�"E���9��kK8G�p��7�t��Z�(�rGeV�4�tZ� $��CU�i�\��L�^��&�e�0�I��,���y������B��i`�r���/���ftd���9��{��z�k��L
Y���x��!�SC���0��\�15�g'��������<������2���x�s� S���F��z���Z���%���E���`�+m���zQ�I��?��^��*�^�E�>��(����w/��*Z^������K��0���Xz�y�vo��2�{��g��u>�����Tf�����;��{�S����u�o�{%��E-�6/j��yQI��?�����(���Y6�-���e�G��j����U���Z50���Jz!���(�]����$]N������n5��R���9S��+���!?���r P!&��#�����@w�d��	��#A�q���q�oFkA��!;.�����8.F�����:����<��`Ew=p�����z$��N���^����d���|��k����~�����\�K~��>���i�B2���0�n����W�S��>5������8���`�;z�����i��)=t�[)*�J~���=�&8=�
������W,X�8�N���e8r�;�2���r��)C���E����4��K�tUJDM��P���V�1A���>��8�ApsS���_?����zjk��<H���MdgO�s2��im�0E��9E_��
F)�$S���R�W�C����j�}j.U�'��tE~�u/��M1�y3P��s
�9��	�����W�|���>w�����O������C���Oc�KW��%�_��&f�44�����]~�E�����ie'0~�8=0Cnf���r0+��@�S���#��n�pP9�S!�����E���S��������<? �J��h��r������z��T	v�I��D�nvj������0����@�DP�F��O<�SMf���q��H�#Y?L��k�I��$�������1 y��rM_0���gv���RR:?���P���~K4����c��7���������z��f�R�uo��	H� ,��I��h����M�D��|\JT90�90�80p90p9�����~

�WP@AA��


�((#y���{W����������}�6@���m�I�c����H��w��-Zb�s)���w�w�o���JT�Y,�~��}<
t���G����@Ux`)<p�
$����t��������A.Q����i9���6����FIS�Y�U1[Dd)� e'�\JX�Q*��MUH���6��l.���"�R�ER��y�%��T�bI�D��)R)M�D�TH4��X���&K���M�R*�������i8��R��y%?��)fR)q����\J�6�o4��E.'mZ��PTFM"����Z�R���kPU��V*$O�P�<�8��$J�+�a�����]����'�\��i4�'R!U�X�=�,�B�k��$�aT�E,��l�AB&
>5o�����A�MQ�|��D]$���|���=�;0���w�=1I��"W�H�ht��zc3Z�\��������j����=��0R)���|y�:�+���GA���C$��������QI�)���y�v�)A�>�Vw
X�P�����g&%��<��������8��A��C'9|}O�������~E���e��}��;�{���;tI��i���""�������1UUH����r��E�(�PA�AUheG��1a�$tl���Q�)B�Cr��
"���0��e�z�"B���(���	�`�Q	 �N)Y 2���{�d}m����I��[�px�N%�3`��W4�KpE�Z�TV,�T�g�"���IEQq��I�iJ�&?v1�uZ��pV��r����f�fiE�����0|Iv8
�q�G<��n@���g�
B ��.6Q����	vq�g�%�u=;4�}m^�}y61fu�=�C%��ePv���m���kl���p�W���	����U�>���p��oW�A����������?6���]MiR���H�qv���#a�@*f�~�m�_{Z��gEPV�[_������S�-�k}�a�%����/Xq��;��_�;��$�������e�v�n{�Gs��:A��!��cys�lwy�4ET�����a���F|�v����/ng��_�����c�}VV�����w o�@�B���}3@{3��@�������v\����w*\���(G�������~}���zv���p�����i�@��q����5���C�+`�4�{��r/��C2O6��
�������A�?�<�/����&Ky������G'|�gr���;u���m�������i�p�z��E�6a����v�]�C_����(*���&y��Z�uwi�Ye�T�i!'�f������sv���i�;9�'��h�����g���-���
��E=����:7�AS�6�+�g��QT�A�O�x��j����p���i
>:��Ng�S?��1����Kd�G��%�5��H�3�v��������bw3�y�=E0D4��59;�3.8�2I9��Ngs���:N����H��47^W��JA�	�m����i����4��(��I�T^�s��m�o��r�Z�go�4����r���0� �e9�2�p�@��f�+�O�P�J_aH��������8����N������%i��;�n���w�>��
�e+|���Ev�k}�������e��Ad)dFCh?d�DhU�Ffk$fGl�4���.�M��ZI�.@w�T���N���E�G�3~�U��2��9F��s!D��
��|6������@���`>�N}/���V7����.M��NO�o&x����}il���s'�+�w'��l,w��a�	<o�vQ�DsFdDv;�K��V�>�{�	�&�+�KCp0h��B��")B?.t�����w��,����vW��/]�`5���>R���7_]�������p4�������n��������3R/F�v�nVm���_��Em�����t>oj���&O��n��;���o�8w��]������6MI4	�{��X0�a�:>����'�m�+�����[�g��Bz����g]-�YY�E�d�8?�����|�A��?h����Vc����MR��[���(I��=������bh��s:��a	�����<�G��:e0\"�N��Hs�p�	��Wx�CF��zo�3J{[�D����W�e�z�^YU�_�=������}wXc�Y�iw������v�#�u�])z�����a>6�t^00�@��A���Fg�	��f���afq���Gp���Z��v{����v�����O�!�J��;�^l��4��N_�V�����Z��fq�1��r"�x@��z���pO5���n�4(-��R�%U;����8��S�050�1���^�K�z�2�vG��{
��E�]5�9)����0���z�:i��Vp}�fV�;?/��A^������������RaA���\�P+Mh,��J�
=CWA���������[u{G&tQ�P���!J�&�fA
0na9����8$��IF��M��IN�P��>!C��>O�����t���f�
���*v�.�*��:��>B�]\�s\��\��(�}/��p��1tn(��tz���B�T����*8�:���'���?AD�p�	l������W�}��7��]m/+�*���,m�g9��huggf���LD�`���� SD"%SD��Q�WTSK��������������==yp>�$�KU�O��`8�kK�����O���o
@���5�����9��PLS�C7M�4e<����IS�C'M�M��j���\���Z�P#d�:!�8t@�^�Y��@ H ?�{ ��@���!"�
���7ix!��a{��x��l�0�lv:�����X��m�r�
��4<�c��91���	@����B�D�<x�B����>.PH��D-�k����=(�� ��$�1�#�D@!�����+b3�d�p ���
7�@���N��v_fss�k���������������|�[�C���pz'��y���GE��N����U�w���3�;���Gysj���y~�>A�������}��x(��:�z��<�"LP�>� T��cSk�G�i�����1#��������C[^<�b�+-\%���������� `l�,��z`�$:�������ED��0� 2d� B=�A:����o�������rW����9������Pe�>�pjy���Fr$��~��o��7�?n����)�y&g��i����&������������C�?-�i~<������
kc@��VG�`��NE�t4�r���a�=z.���7J������5b+���7J����A��L���):�4�,���+����Y�%z�������)��)�R:��-���^���e��d�8�'	��Q���1��{���v��������6	��v��v��(t7�d��M?���8��8rYf</��z@%�p��httT,\t$TT�Q���'6�[�El��mY����?�c�et�Q�{n���t���t���K����@Wx )<P&^P�}���rd��a�9u�g
f0V�4�n�QF�hc�xt��M��9:>1�f���|��P[�sFxd���^Wg���%��@;
�w@����S
��#6	 7	�7	�6��<o�
�aI����`$��7�&���;KAn���������&9�d{Yg�l����Pr�g�j�7����e�*�c��${H��8��d4D����gF @f�n���\����[���w�Gf[:N�
�PI_���(,8�A����	;��R������ZU�>hw5%���]�I��:
�48Ak0����m��UJ�Rs��n3�a�Z1lb���Nw�C��������6���5��u�@O���t��B'��m���/�b�!�S@����S�BO�n��_�G	�t�%�Y�^d8e�S�H1����26�R��7���!1����� =H-r��"���7�d�%"v�HV8���$^e�]db���"Puk
=a���D|���������^�Cmqm�lIm�m�lqmYm��B�=�����V������}�
�	�V<}|RM/U
�px:�90rK��6����:�����E���6�n��x<����	���a��Y���=��0�v�����L���*��3��F�;�d};H��1V��.c�M��1�Fv���MgyX ����������C��nG�\IR��`��a3�����>�Q��I��������A	�x��z�dz��z��z�z��z�z�$z���&.�I�j�����&���j����W�Q�n<�6n�apA���q��F ;k�ST����
�<����E
��0���t<���4@�a��0:JG���KHG���}�{����b8��~w)��T��c�Y�����5;��x�*sg����#m��Hg]
b-�KA�� h)����`��|[$|���������3���������f"s��	wV�$������1�E����?�:��������j3������'j�cM"�����B�0�K&S�b�(P�<��V�l=
���myG�PGQ�$@�J��IAI�)A\%	�D����|��������������o���]oVk�+�u���Qm����4Y�����6�)��/�X;m��R�������$������=��<�f��i�
Vg�f�E,�&���y>��<��-g[��	C�-$�_(���W��|>���>�9�w��
�l�|&r���,�����^�Y
�U��\��<����e���p��5
�N�l����������t}��t}�v�>�v�> ;�8�s�n.�fT��a����0����l�����dtO{��T��3��{��n�X�i����5|Mvt���d�������������/�(��f���w[�X~�bs�/��q^�@��5�i��'eM���V)��3:-x�*�NZ���:l������<��o�����,������:u�z\�h�[\���
�+#L����si
�-�{���)���]?��������� �����74��;����|���:w�S	��_����X@����b�_�_��d��s���O��_'t��n���{�����5�]�H���#��g���l�g�5��Y8��i��]���8��)����J��sxi�S��k���TSa�h7|��@ao�e��i���{6�����9z�nc/�V@�^��v���� M���#��r���O�ib;����;:����7��Qx�$��eyC��zY��c_�p����@?#�gn��C���{T����K����K�>}^2_��vM�����w_6]�?*+�]���ED\N����"��Z.({��]�U��a��)��������xt��������^2�2��L��������X����#��YgU�~�>1�B�;�J��t
b��0xB�A <\.����joy���>��Zb�����Z���4 �������t�L:}lu�����m�m�.'�v���
�J�|x��7:C�����P���������~������+��c�M[;�_�g���T��5�a,�q�CSuvn�3���/pM����X��]�P
�D������G��.pd�J�K�-��?#H�8�;�_����t��'L�!�5�aMX����5qaM"X��h����	�����c	Zfw=J������LT��5��wQ1k
{��G�������yC�P���_|l����Na�+N*�SP����? �����t�|��oT������s������-fb;�[/����T������7eB�
����n^��+����r��s�Js���9���tQk��~�f:-�4V�l�~�[{������o��0�,�@�'�	�2�Qe����[V[L��(�F�6��������B�7������������t�:^�A�w9�s!We�r���k�����(�J�����5���8��}��
j�f�w<Z�Y��|�]��O�^x��~p�9�����C��|o�m��%���S:-��,+[��q�W��D,��a��^��<4 �	{*q����$E��O'&���X�D�0�]��b����`���YA:�7���Y�c�b�S�oR�=����$��&��PO%��P�oy_gjg�~�b(J:_�!l��[/���J�������7�i�w�����4�?���~W�gu��K�Vm�t.�����G��Gd���a=TD(��@@ �  ��H!p�T`����������\�$�����0�-������N�1u������-g����>��neqm����r[����o�����:�OhZ��D(N�N����������K	���@)!��$%���&�]���a�]���3:L�wZ�I�5��'n.P�>EFV�|��u�1C
x������l�m6^�'�	�3-E����������������-�s��CM:�;��!��6-n���+���:85~ ^��<��.inLg������P��d�k�c41��f���&�!�d�y�Yj��~��en�]�-x0����yZ������*���<�{��7
�_U^��0������v�t;!��������x���f�h�Wd.�5���������-���M��jv���J5�+8�s`h�uS���^�}�����d���Y�a���2j��*�L���RL�6�[{���4�e�I��K��4;|X%�����l�+K:s������u�?�o.�������`��xQ������s/����9Uj���E��1�l��2���Z��<����v����s�����E��c�q�n�_���Y?�
w(��*}olRXW�O��L��MZ��	v�%zf��`G�E�g�F$a��ew�_��x�3�F���O���4
�M_�b�*^�m���k��7����������L���v��j��9�(�=~������k�<���^�h���Q�]
�9� ��=$�X
��W�j����5�v����x����K�.�i��]��|�@)U���9��F�h�O�G�i@M3�W$U/�
�����v�|��^�Fe:��Z�2:��8��=�����~����.\��Wt�����C��H��T�A�����3�(~|���H�=�2�~�0t�NB�I :	E'��$���pt�����$���E�.
�s4�N�)>{c������������:�f�]��O�� ��	�?>����.����JGj�J�i��}|�&<��,���������x�J������q���	!������	����@��,7q��i�K�����������g�����<�Ig:���3��������X�P�Ye�����8��
�d��nY��qsj��oe��T�a�2Cy�M�~e��P]}xj�Q�j�78H���!�rgc\���L,m,������n0^I�2�������,v%O����S:o%������F���(�����T�b[����K�g�?=>�6	LL��X~�
!��O���Xc���Y����QDu���1�%�'��
,����^�A����@�Wf�c�����nW:��Dhi
V�.��B��)H���'��'4t��X�\T��v@l�
<~��$�(X�P�e��Q���L��3�\6�6�=��7Qt���J<c��}����d�"�Q�8���j�
vM�h�U�j�q����hZ��0����@��&�g�#$���^��n�{c��7����X������!�5i(���f����T��[��������]���3m�Oa�����o��<� ��<7�']��N�~���'�U�X�]<��v��z;�q�{��8����m�A�MS#�������`jb���Tg#W����QId�U���rg��V��.��L~e���<N�e�mX!���NlRAX!c�]������������W%�S�a�|�����8T��C
9Hz�4���(�.�{q?�c�jAg��]~�@�����6�
��AmN�,#�x<����s�BET�t�#��������*w���4@hr��8�j:
/^�z�M��M���0���{��k�
�<���C���[&5St^�?�����W�o3���\;�x,�]�����L��3:����M4�*���������U�������m<�[,�r���>�u�u�GKJ�ZR�m&���������ZYh{oy!�������\�
�>h��K����</~��o���
����C������kIu<�ek}��Gi�
��{��/����p��%�+u�4����<c2�F$��	��������[���~��?���d��l$�$^.�����x��CF ����Vi�����G��0VP�.�@T�
*P�f�^/1[0�+�l/t;�� j[%+��>����z{�m��z�4)$�K���]�R�Gi2�D���DN������;`���%iL�r��A��Ck7�"�?��0��m���mm���-k4���z��V�yH`�~
J J�T�h�r�hr��D5�I����t(�/s�������O��EP�}u�!�	���YT��3����������^�<1����lv�(��6cn����j�E����_��`�L�g��XK��J�
���\84�� ^!�9M��N��Lg�F)���D!�~�A�x^!^	jk|uv5��7U�H�4	t4�<0P-r��R�aVe���-5�yi%Q�6���"X�Eke��L�,At�4�!�,��j���!��=�������^��rAh�|����_.���j���'��)�������pt����'x�-�wNp���x����IV}�;�e����=O��J2h���XxX=^T����l'bA��C����
b���3�yvp��\��b��;�zq,����-�6M�v[h��x<z')t�����Y���,�������m
S�� J�A\=?��7�M^�F)�z�:C�'Q�����*��b*��
H���@g*��Z�"�#�m�.Cno+�\�m<$E�ob�Ed78H*��,�!��FL1��#Y1	S��+��N����9��9�8����>������,X3��������;n� (�Q7j%�2+��r����9�~a�\dz���0�o:�iZ�7��1�"���$�]�r\��Z:�;8Q�Zx�T���������s�b�L�E���Ta��E�N)t����C�*<Dm^�H�m�e9t���
���2��g�"��g�t	��@]2�$>��� ��o�A�<F
������6�t�����(�����l�����%U�|��\�����y���0���o���Du�Z��6m�f�1�t�|��x�!��dK���"D�2(�}�cr�LI�3��U���������_�4b��m�]���.��W|`�<p�B���Qg�{�n�	hf�Y�b��$f�Y�f��ur���3|������<7�1�<��K�7%;�>
�w������������}n�6�0������`<TN��dr��l����z�����y�n���i��o�fS>��w�?������GJz���d�IP?������O�O�b��,Z�>o<Uc�#��i";�e����m/�G��������������*�{85M��\9����ihjfxd0V����H�Yg
�	X�=9V���=�xck���� E�%����F�P������^�b%+��]e���>�	�j��-US����+/-�OI��g� �
������(���b@p�:G�Q@r����%*�|��_E�N��g��dAp:a�BB,�>5x�>���Zu'?�.��t����^��h^Bx|�1�s��8�(�b��*�	M+R�5am�W�@2��g\�@i��0tL"�#E*L��a����"����E=���n�N'��;.U���|���e�n�n������1�".��T�1�<�U� 4�����	�>9�$�Q�:1j/�	������!�'�[m�*<>MP+<T��a��`���(DR��r�j����\:/0���=��iB9!��@M������p�D�x������0�_9�������C�%,%Z@eH������@��� ��!L~M!:Z�$����Xl�elget;el�)w�����<�	 ��H���Z�)x�<����n���8����@JY�0�P�}W��}w�M����� <+S�h��u<]RM�^�8gI
����o�%������8�3�	>t(��JI�DE�@j}��$�Y����Jj�`"����^]OG����=)��{������;���L�����}�������|��`������+{�&���������N����4�����z�����i_B��y�d��S���R���"�d����b���B����)�(��A�w����N���}���%��.���j!��rE��:u�N"���������d�� �BI����ev�J�����(���<��x<���	��0��'-{vZV���Zu���u�i�X�e�����(��E����^�
J�B	��)J��<=�y��F��$O�`�ah�rO\m��<&��L�s��'Wzv�����&_�x,�Wb��am��2�Z#)5_�{M}AV<�j��Jw�[u���>[�9�����t���%�B��z���p�@�'�)�r��QL�^�Ln�rG$��������3�~��5 2��gZ�g/�F2��ui��j�CH�a�35��K�Im��H �srf�\~N^p�m���4�I	�*N<��[�Z6�in���pN|d\�iYU��2�Zx:d���4��C��|18�+�q���%�;�"J{���_��E|WI��y10D�	��q"	q����A1Im��N��C�����Ihs��r����O�-�H��.����9��8�^���j�%�
��������M�z��yr������N�����Q�����UV�b��;kN�����\<��z��E�k�nJC3�y?��mu��w�������� t.�����q���Yi�L�����{r��N���5��|��?����j���|u���������y^x������Q�#����Mv�5b&x<�^p��}�*����c����������Bi�]�;b-�6����e��b��{�p%��s��4Bs��
�������X��C��7]<mz�!���vF25�3��/��yU��;����Z'�|[���H���2�������`]B������	�&����(�:P����A�4��Aq,�y)���(A���#�h��:��$|������i��j-2��x���S�G��sU�����!�4+$�\��-��y����PK��6G��q��~ _perf_reports/20GB_preload/ps_2_workers_20GB_preload_2.7_selectivity_8_task_queue_multiplier.txt�]���4��SDB��������a���r	���8=a�I6�0����s_���^����������].��e�u���z�u�!q�'����=�yN�g4�4MR��t�=}+i�jB��(�Br��{��
~�R)�%)�Y�-Y���V�tY�i`�w[��H���ni��Q8�Vv�>?@�8���A'�S��!<�?�[��o��KS�>	3���m�G����7�I��G�e*�"}46����^f�]��)
s�/r�c�c��l�ne$���%���e�al
��#���V��������[�nB�(q�G�Q,m6�i�-
3HY���s�4}G��c�UyGr�����]�M������	lx$|����$�|e��z�i�]}G����R��������E�h~����?�����_��_����^T�4��)�Z��Fr�4�}���������H���c((h�s)��B{�,�7!���'���[$���N~�k�v�nmL�����W����x$!|^��	�����R'���g��}�\]R������^��x�F�6oHRuxHh��4}�,�"�A��.���hg9��8�k�/��=z�h��Z�F�a�|FI��a<#��z>M����
�f��:��R�;��tV����� T,LykbY\���N)�q�o~vm?�� ��,f��m������G��Yo�r�n,EU��I�>u��7K�3H�/����b_��$i���1��������*X��a��E��F�P��e]�����H�UMN�6g��U����+�D����PUXU�4�-Jn~��0��D�[�G��l
[��^9��dY���x�.BED�w[,��w�����O����yJ�<�����,�#2������A����]�'B�P�!�X"�um����Zh�^���{���NXG��0�P3Z���Z�&�v�Q�����nH����5BEB6���I��AzL�o����o�V�����_G�A�)��)k����x����g��Az	�sO�Y���#�<�`�)��f�h-<M:���*6h��� �P=���]���" �C5�����l
Yi���}M�����/"��C�E�'�F�x���o�����h�?�0iZ:�Ea��p`B`X�
	���'&j�IR;���D�#�|��~�u���6����6��V��Y�r&��l�����dHu��kf�leK2���0Q�!�8�3,�4���!��������<�fFL4�a���!���L`42b�z/�;�77
s[�n5Y1q�/�W9�3��{�������g��?c�nF��o�<P;���3�(�l��R�c���%:����"Pec7������x����v���g����E�����=�N�z�����I�8��SGeE�F�`�U�	�
��D�w�H'o������>�NVdk��gnS���/O���_��4�ds>?sH8�����*�cV��!�������Yr�)��z:8�p��-�lK��J��N���f�7��K�4BMk���&����l���� 4<�g�I��9	�w0<j��Zn���#M|�&���;C��}��F�������x�#�TqoI�P� ������c��tE�9@&�Z6����� 4<�h%S#_��������H�%�%$L}6�l{��3'���z:����Q������G��B`��0m��J��0���V��ZU(}��
Zu��=w@��I�	���P�Dh�'��0�;k"����	����c����0���S!�V�~�U%T��-tlZ|���,�:�]��G�,��"��@h8���{	�('��2@w���(���NW�a0@h�����LM�c��?���0����Zc������:.��~�#��Eh+�{����C�H���b�L!��t��c���HY<\��h���A7����d�� 
����	����@������o�c��Z*h��hh��>F�AhY�-i��:7v����8+����e��p�	d��l/���/�`���)��j8��	��!Xn��l�����7$�|�����z�����G�R��^��iWQP�1TFl�\f��n�*���9���B��
������N!�c��(MU��(���G]�BU�2EA��b:� ��*����uk�
�(0�����L�
�����&\A�k��FESs�EQ��y����x�0��l��([�1
QU!;�z�w.oCG�~U�(�p�;o����doj�J�������P�:E!�^/������0�lQ�u������b������p��pv�0�kx�0��j�p�@����DQ<�+Ik��k�����;T�sT"n"8�Vo@{�1��h��i�����
��Jm#LE5L�
�<�f
X�zE�2������G����vjG*�3�cW��v����`��s�/?�����c�Kkf������79�j��9���9���+C�������}�B���r�
�LN �����/��&��
}f�=�Bp�@�,�cK:�:B�u�[���������i���
����x�)B��4�t����Zh��s-~%�h>,�w��a�|'�:�8H, p��w��l��g!�����.D
Z6��_.\|Zl�#���Bh��=�l�`���{��Ehi���/�-y��3z�*[�1��j���U�����"�D�����
�{K�D�i�X���(����u��kA�I���k�:�[	���:v���a�E�����,"�(���������	;�pM)�����{`�� �f���>��-��K?������G������u������x���B�	���r:f����vx �z,e���W#��� ���!�" �b�
+��rV��u
c��.�Q������<}� ��)��RFA��9��N�B�f�Q�|�'Y.��Eh��)���UR��O�oXp��1�&1s�(���p���@�7�������^�)�������A�B�iE�~e�Aho���>BMeE��G�?�(�8�Vn�8Dh����aF�'����&0l�N�n6'ZV������2�L�B>e4F�3�}�:�-%���<�[D����B���)�����"����NSj�j}QN��������0  ��@��\f:��w+���2J!��X�e����T!��2" �:�N����N��0���!�����!f������c��G~�=�C�e,eI!��]D�m����N��U)lZ��q6�61#1,����L>W�*1A��7N�1{�Ghf�������za3J��6����18�h���}�Df�����/r(7�e�Yd��.>�UpA�s\��m�&������$cP�����M��WH��7z��
�J��[�#-�qf�)��n�a�����~K���*��c�x_�Hz:e��~d��m����$f	!��	4}a���7�n�,l�v��{��9sl����������m����Rb"t������/���#4���.�p����`�l��S?<�K����G.��5���W�KHm/�U��WuO�I�U'�������5�R�D����I�&���K�������'�]�=a�tG�&V��+3�u����F�bX��9�Vh������7��������I�$%��-��a!4tVl�]���E:=���hF���*&-���c�b����(�YT�*�6H/cdiy�`Y�e�6F>T6���~�g�e���k���~���
� ���l��b� ��ID����?F^��W
LY�m�-���a���%��5��H��p{��!	�i��"�l�4D �s@?� 0{Zu��dfx@YQF�6���G��C�1��:�cMM����m����y9v_*���e*���$�3��&���n�.N��K=M�wb�s/�Y����� �ze���V�����>5�q;��S<���w��}7X7��#����'������=��e=�����Yq>�s>�����s~�"�ks�w�A3`88W����N��������;�me_QF��9$E���\��e'���d[atvr#�����l{��4�)r��a��6����0�����m��C������y���unsCKt�^�$����
f�$��xU�)vc��G��r��(��@�GB&��V�R��Qq�W���~�=G��sz{h�nC���3����dE
2���.%��f��ej��0^)����p��;���
���a3)��A>U������=h�T"x}�U-8-#�p��|u�]6g���Oo�����;��P�������k��)0���*dB����6'w�O�!q������`}��(��O�~[f^NJ�e�XN�hx�S�B�hV���;�N��^
X��Cjj��N��[7����*���db��1�]?e�;���������I���$%��y
��[!��\
������,f$����<�s���+>>�������:1�>�g�e����!�s��{mvr�N��5����������oC������������4=�!P;kNkw@������������V�K8��x�i#Lb^;c���x�fn{���6���w�+�2~�����L�p��� ��J\a$��/��g|IE������%���8�8T��B^�3i���\,����sv������I0?�R� �S����*�$K��s����l�� ���!�N��`U��'��jD�q*��-����I����G��b�rdA&�p�qX�e����	���W������=O3���	X�4|�����Z��/;@(�s���q�
O�eO��F������	��.�,[?=��5��F,��������2���
mQ��Zg�O�=��d�����Ne�hu�	l{��Z�+ �P�e��AyH)l8��i5�15�k�q�k\X������^���Z!��<�X���J��;�.�f��$�3D!-����r���/3V���+�Wv��(�{�U'N�V��V~L�����;+_�S���C�X�@!� oir�$�����#t�`u�D� ����]�����$���(0��H���ZkW		����d$�W��-�v���
(*���XHP#,�d5S1��c}o�9�|���^��
[��bh�dH���dr\�|)���[\����?�(w��4�luE�${��X�������.���I�Z�R��	��`��#Qm�a�s"����f�z��~\�'_-�������x�H0j���L�[�}�g3.Nt��`e�t E�w@|'��v�Cw���}�P%A�����v����M�1x�����{�|B�X:�
3c�xQ��~��P�B��;�+V�L�����������g	�xp��^��_,GL�����n�v�]lG��`De�| ��/B��w�t���|$��p���0�{.H
�.�G��UG��	�}5k�bxt!��
$?����\�
�m5&��sM�o��(�bV2��}m�����<"�<�Z�K������l�.���a*����"��C=�9p�"�_���wZ��L��� ��+v�s�9��I�@���IG�����x{u����9!P�v=9�C��-��?9u�]�6�Z��+ s�X!�(e�Bl��s���W�������^H�\\.3�~2	'��)(���m��`j#V�{�<-�n�a�u��*/�tk	�����x���z
�(���m�w��V��F��_��s�.���z�Dr�(QA���~8L�����-���Zv��g��t4d�I����
>1�A�G��Q�j����Q������vS�X��^B�6�Av`<��\|�c����
w���������H��'!��3z�KB�O_�9�A��eD���^�-��gM���F����A5��	�]��w����i�<AX����=�S�6�
���.��s\eX�RYd������"��wB���@�^���Y����;��"����6g�_b����O�3k����W������]x'��`Z
n�j|�p���QNT�#�Ouc�������aY����Z	_����"���*�(
�h�V��a�������Je������JM�5&�wL�|�2����*����m����p���|�~v�liB[��8|.s��$Lo}A2�Q��������[����������A��d���O������SW{� ����<������<����r-l�s��,�L*�$=�~�l�`�;������4ji�Us9�����*����e�mv�W�ez�ev����[@Pl%*[�eg���td�!%�������|K��C��c���l�Mb��;@���W�M��P�e%��\:���"e�2K(��h����-�K�),��jp
)���+~�P�\�� 6)�H�M���P�x��X;�
�~=1f��O�L�[�}|��8Fpv�]���
�����W�D|�����Y{L��_p�oM�mA�]���+w��_v�
�uu0.?2��K��o,@�]�Nj�i��.�"�e�-;1������~�P"�Y��F��3N�0v�S���I!�"���)�a��sA;��pMy���n�"b!�7E|�cMV;B���!b6��VE3���"B�Rk�A0Ld�T�1��Bj66��eT]y!|��u����I\��U���]���#���
������L���	�5nR#�#S����l�A�������0}�[���tP'�"��P���J��	D���O���M�q���YBV�mW�lcK=�5��T�����WL��f'�(�RV��At��#���[��Nq-�V��8(f�F�e��~���pnT�Y����m��A�!�*fx�G(4u0��������J�N�a��	����W��1|#���W�K$�������l\��<��y4��R������vblY�s&�i+E���,�>��Q�j�x;��x����g��|�#�7�G���M��~�����>����9����F	��m$�w5z1��2I����"���~�ym�-XyO������	��^~n�]��y�t�E�GpeM��\�����1\����`����>�d�<j�"�����-���X�����3��.�*��F��-��.cb��^�sq ���&H��5���,d_�E2I��zfB]�A�!�n����L2�z�&w�"��������1��	xA.p�k�Z���pv�X&n��*����i��mw����p+%J�=���
��vP{:�GLT���PT��0lO�`��3�k��E� [�g.����������>��SW	O�r2U��\�D�1�/�>`���v��Z��4��ez���8�PK��6G�V�8TS�	_perf_reports/20GB_preload/ps_2_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt�]{��D��OQ����-�k.1�u>�3��1�fwF`�]=����������x;�tEw�~]�]��&��\o�	a�)�yA�>���1��7`�`8�L��a��6����2
�1<�`#��x3+ �WqP����h3�bf{�9�L-^9������)7q���8/�,�M�E�b{�9�<���{	in���JR|*x���jw�Dq����2^���;�S����G������u��}T��?���:Sa�O�t����O������_��V���Y~'�w,��,���E�u�������]$�����q��\\��\�L��i�)��|��pO������S�t�\����o��Ws�5�
W>>���]�yz�UT����m\}7���K��O6��I�����F5?{��������������/���gH�eV�t�X�Oa��~�9�
DI�Yw�|_�������_^�
�3��/ [6��V]Xoa����0���o�M��j�<�u<�of:F%�lpWq�G�z������*����oqX�zq�^d+|��\p��}��{T�4s�`��y���/����v�V�����X=�����$�WYx=�6oyy�����HN��!��c�����,�	
(����Q�w~�������$lR�E�qO�"�L�A`��W�~��'��q
������8_��'�e�m7����j�o�h(]qW�"m�2^�a��/�O?��J��f�H��k_�CT~�W���B?���r��B�C�>�Dq�������Z��z-�D����_�����]_���}��S��y�b��~/�|��[��ir_�����F����*(��^�!����W~u�&X�����G-H�Z���utC�]�2w5{~�bLm�p�����y�.#V!���^y>�f.��D��	��T�wfX���z��`�X�0\�@*X�,����J!���"���a����_�@k�0jX�������4.������$-�.��4�l��kp�u~�oq� �J�x����.�����8�|/��A��$.�������9�t��+f�;��&����~�(���W�����q$A�Z�^�����_�����]~.���R�KN����&���n"�z�~�92=�����f��/N�$e�V]�2Hrb��dAX&7A9�n��nvD5���)7�������*#u��7L���<�/|L�7�����_�=�o\��������'BV�^���r�U�������0��#zT����MN^jfkVej.����u��
�.�4����0*X�����4�$���n4��4��4���*���-�/kb+�M|]�?���`X/�4���U&v�/����j�K�n��Zg�R��kThG��� #5^��(�R!.6/@e�����er@}tU��]���A&�(����(|��[H�}�H%����m�*�����V�y���l	�z��$�U�"���NM��6��$��0��I���0�"�G��G?�Q�U�}�a��-�C����>�U1�H/_�������`�X���j�3-�U�SO���u����&a��jR���I�� ���<+��<�d�$��h�����l7�@^����	<�
������0[�I�{#`�����p�es�����EO����Q�Ik/�B:��s���MJvF�Q��]�Q�R�{p����9
T��AqTRD�
�������~�������].�f��x��/J]
k?	R>��(��������B�*���%^���Dy��U���?[.�B|\�*���LQ�����v��*�.��qF��@i"�z��c!X�E�� �k�Z�!d>!^q�J�����d��*�1���6>3�H4���?(�8(,��:�z��<Oi�`�	W�)	_�[�����<���;on�=���/��x��x�LM���l�PVX`��~I��C�K�<�%~6Z�fI��e��	�g��hksC�!��"9�_�9T��G5H���	T����j���P
��>���j��:Q}���q=�O����
zPS�:�F��;�jVD��&�6��	lM�h"/��%�i�����������Y1�D;�a��^�uk���<D���I(0 �
�>��g�E��e�}h�`$ �r���[�����Q�1��V�v����
\����t.1v*KEpW?�_�%��Qr�^��	��&�n@51O_��X�����nQ���n������q�m�
oa������a���yQ�~����hs�sP�/�����E\V���Q�D�Us���wW��1{�@=���Z���7�R,�v���N�^�{���!lu[J�S_'��&3�����G�}'�����:`���C��}��:MAK��M����
����i���=:����q#��0y��i���[�a=,�$��w�9a����l��{��)n�+fO�=�P!��EE~�(`�������&���,0��*�l����&Gr8~��p���������TYMAORL�q33��oy
��j����X�@�|�W>�*������o��z4�=W���'	)���!���l�4��:��w�k�.�2��(9����M��&`z�m(8�����xQ�y"a��-R����6�g��(��� ?�����3L����y3����P=�zDTWs�s�|Q��fx�����D-o�#M�t�b�Z�����H�BEs�Ay���eQ�����kP�Z�Ys�1b�j�'���5�xr�e�����2wz%iR����/�l�������)dX�l��L��82	�S���H����*b@x_4g[|������������`�`��%.���C��k�h�����n���n����n��g��������5Gl�A�qr�c6$,"Z�_��yS��L�Y�������b��>s��[�U@�.�Nq$,�y����2���9��	��O<He?P�L�N��D�T��S�����
�kE��E��E������vva���t�������/M��S�$�0���1�D�����h������ASnp�+�@����;��g���&�_S�uS��S������v���4��q�I�����]r�j��`�V�*��Q���}�����0��QY�@�4~��j���Q�^P�O7������BD�~�<��6Ec'K������:�	�&����i���uv��ZPx.d���9h�r�i�C>,�_����*�����y=FK�z{x�6�9���a:�c�������������wr�QZeS7���Mc�LGa����:9�O�|��G�����`��������a����0p�������k��0\u��Hh>�Q�"�������2�AFj�$MQ*l�B\l^��nLs:��{�z	-[��[��-(�`����Bx
��'�]�KS�Tl>��r�|�;���(s�U����h�#�D�~���F��N<�n�#?�N�f��E���W�V�o7+�U6�6_f��vS����s�~�EB~�� �I��h�>��D�#1>����D�";?�#?��b��Dtq�?������%LL�������P��Y�#����
��9��V�4����ye,�����F��9m,4���1FHc��7M������|���5��W�T
T��
|��Ye��N5DJ�P{l�A�������So���,������r�L;p��AjNd'��?{�)s���wr\�	�8���
��� �h��*�X��5�	�x��M<'08�H�	������^�S���	�0T�	���l��`K!J�������N]�U��d��J�n����S��Fm`�VEkJ���?��xh��Z ,�X�����R�Bg���>�z]=	�T?x��?
���07�i�����N.EW8,f����������w8���#��"w���T�A!���Vnu[J!�[ww(�+�~�|&��TC�l��o*V�l���d#zDm*��U�\�yV�ry��&b6'�+����Q/R����#y�H)����1��c�\�=�q�A�DE�k;����(�u8.3�H���E��vgp��S����y��9F����>��QeM���5��h2s�k2������N�p�sh2�\�U;�y4���
���a�����a�Z�^����3��=O��mLE�����2���2<�&n�L��Bn7:"��#��25�l�e�
������������h��YzR��nqu4�3�k=�5���]��gA�����8>�&����Y,C�[}z<�g��?��FF�����[�k��,,%4i���e��������a���S�m�y�������� p�sC%��]C�U
2%]o@��W�=��<D7A���%&�������m9��������;�1J�����R-9�N������{R*��Y2'��k7D���2������|�����oG�U|��X<����T\�Y������0~\�x�����y7D��iCD>�Q�"�������2�AFj�$MQ*l�B\l^��,{�Qz��?������-+
�W�'�@��^����W��v������-����������3��&��������\m��6��Z��I	���G�L�pw����^UW�HSWJ��~���,�2�bOp, ,���eq���6(L��Lky��v��]=3�_X�T�#���"�a$���"�gP 5b����m]��^o����n���fP���K������������	~��O�����>��P���K�l
��,��[X��GG�T����?�-�u��"\g
+�u��"\t�e��T��}������������f����n�����z���^m��S'�-%A���R�9O��%��*TFH4 VU���^���`6+�&)A'�%����<R���P�9l%�ts=w��$�9>0�9�$���G�$��`w�(YEK0:������,�H�Kb������`^��fT��]�A��u�D&��
��Y�?������!��f~P����������L�
T#,��G���0����IF��V��Q7���d���V	L��5$�d�;�x'�I�w��N�	�;]/~�h>��j�������]������W:��;��V�j�)_�,�?��������}{Un�wX}v �x��nX��r*:�jw�6[����X:�~��"p�[��)�J�����v2���`��mT��	�%-�q��*W#�x���.p�[x���`����#������M�������'9#�|��TL�#
%��D�*Z��q��v�dAF:^S���4����?)#�N��Y���BP7�d"�q�h��~��W��+������P�}:�yM�$��tX�S��!�a����d&r�N���0����|K8�����)�*Y�<�r�n�lv{$��Q�+�Aefx�k���Z#�D#�]Fc,���8)F*�'��h	u�;Dx��c�&���h�^�u��!��+��s��
����1N]4����</�C;;$��6w_O���p��i��-��,���a��:�B��L���L��7o(������d������C/C4��<������~e���+�{���_��^�}���r+�Py,�����D�q�,��)����sJ�9%��4���S
�)�����D0p�W��S3�5|����a��"��D�o#k!bN>q'���O��'���	�|bN>�'�X���5����Y{X{]���`����PR�<����]g��8����y���H�Y������bJ�+F�0����^�4)t'�����6u�;Dx��c�)�8b�l<�!��E2������5��x�fq��1��T����
��	���t�$f(
Yix�����8f���Y����:�<���������V*���QM����!e��������V�;w$4���e��	
0PP��<�b��{*`,8 �n?H��
��W�d����db�8�y�H�m�D����Kl|�%�<p?6}�����F�z��������2��Z���v��iU7aA�����!��^.����\x���\x�)^.�F��|\n[Rsuk�c���65R�Fz�������y_-�n�8�S�x?��p�4���R-����O '	�"	2 	�!	�!	�i�n��kw��w�y����z�y���B����x���:�djx[�4�������&~`).<��7If�M��&_�Fb��(/7������������+���  �1A@\��A@L�A  &(b������Gh.��ZW����5��Z0}@��<�l>�����I��a�izW�n��
t������U]�1��4[~"����r���4�6�='b_��<��	T����Q?��o��Y�}�9P��7x.�Q�Y{$����������J�������%�Q�6H?4�z���L���,�7�����������52.�}�f����#�v���"�2�CJ�ja�+c1�����`Bb/�t����_a�D�&��M��)2?#<�nx�I�E`{�����?��I2��~G�^�������4����<��s�-Lp�Y��T[}~�.%�:��<�Cw�*����im���Nk�O�a����R�5��������
�Z���w�[�E_�5��[��]�Q�S��P�hE���x�0����k
����x��g1L��1�^,
����P����{�d.?��O���yc�=_.��?U�jxzw8��y6��VxyO��=��1A
��%�����+P�M�g��\���p"&�<d�`�^���$4hzMr�Cp�Cr�CtDp*> 8���^j"�5M���	uB�Z(97Go^��Y=�����j���X7q��L�����u��W+�|@��:�E�r��X3�,P'�Q{b()�����tT��������H��g`B6� �8��O%����(��YEb��
�3��|��'��] f[_�
��6�x�l����hm��6�v�;��b�2�O!�������y���f���H4Jq�hl�&u��y@�4Lne���[X�F� N)Y#*R2��H���<N���4e�(�!�FP$^���x��F��#�6l�����;`T���x����)�~������H�1�EAC�V��|��^�5��uZ��V{G,^}^l�o\�����^~�M���Z��s�v��?w�t��N�"�#l�r�!:]/���b�������/��j��{|��o����u+Y�V���t[�J�����o���L�����^�SF��f�~8�\�>��;W��c0�O����8	v����I��&�����1��`�x�ZT�1mk0X+
d���|�@t����X���HK���(P��N�\��V�%\�;�\��}�E������.)f��4M��-���W��_��gZ�p����[�����m����"�
��Mn��f��1,���1��b���c��i�)�
Hgj�����:������f(Z�����|�� ��kD�%�8L4�75n���nI��� +�)N"�0`�2Z���E��������b�Lf^�{o��r��m�������"��)���=wn��f�����
OhS��������^������P)��c���_6������I������_���~�|��IN�ARx�w[G?��x�������j������]�����'�[��S���r�)���
�h�O-_�A�-���a��}PW���*&�6�� ��I[�c�Ox����xa\M�P�0J��Qt�_��:{������AGOzIa��=�,�i���������z����x5Z$ya��R�2bs�w��l�i���^x�dk�>����c&�(K[���=v���k�����L��U5�9�eY��9�$�����T� ����#<�Bx�f2�	��v�%q]a0���v�#�#�7�SW�s�M��
���\���Dq��TL�#
%��D�*Z��q��v�4�xILQ���\aO��6��u���o��]YhV�m����2���������!�$�_�;;�`�x�N������B~���x;��q��#���wm����Mz"�6��[��AA�a���������������hG�%�mST��MW�^��H��H��H��h(?EC������?1��;����o :��.��8:��������U��"��'�2�<)1�llR]�8��`	zxYL��0	�$Y�Fx�3��Bsu$�������x}�20���
���q�D�G��V����XZ����8z�	`M2X���5	`M|X��$�5�aM\X��$�5�aM,X��1���+[vu�����^Z���b�!OfSy��J�9^�H�2�{����l��-��|r8�M����r1�
��>��?_O�&q�i����l�k�	��m��,{��61�~lVt�@S����UQq
��U?$��ueP��~���J�����
�	��x�K"�9����&0����� ���`
�-#�2^%�p;���_
��W�%�o�"��B{�o	�48��g��Ac��p!��+�G��eD�A^��Z��M5VI���n
�p���lBF�RX�F�.K1R2�?R2,#R2������S�H��a�Pp;�7/�;/�?/,#�����9��z�+c����h��Y����+x?0�A�� ��PiRM�@���)Ke�^�W������l����u��A]��I��u��f��[�7�,��|��/�|8��c���@`�����sF���)����%8�:�J22����od�?��&&O���-�����(4.�e�J�"'��%��Y�%	]lB��F,t�MC�"��t�c���&�qS����>�a�'^��R>H��a��O�4��n	aY�O�>����y%t+�X���TF�b��%�!����B�������|?��(`�v�;f����2j���b�8R������6���_��h�s��t7%P�4]
*��#���b����b���b6_��<���������?���������f�Y�|��s/�9��. ~�]:6�Na��������T.���$n��v�>'���=�6�����S�}8�^/g�C�[M���,V-[��a<f���
g����#tN��9Mnw��K��g������Qt��}*��#��%T�A
~�/��$��}��K<K=��Y����]��
�<��(��>�������[�.>9�S�Ua����l�x\w(5M��3@.��>P��"�)�Ze,+&\5�X�MXwI��U�L�8(����J���.u$>��>�(�&���@����I�n_��'�W�n��y�c���hhG�@C������a+
q�yaW�,���&D������&2���%lI
[R�������
���/V81�L9E�)�����`��Q����'cF����2x���e�Y>?��7���vj?LJz��%�X@��<��#	���AMN�qF&I���0*�xo�5x
!
G62e �U]���fc�
.����+��%�R���W-Z��R%���TkR/D�n�~�b9}��I����%�6�wrz�L���f�{���A�7U�	W��|��m����j��`���Q���O�]6�/0�|N=���+m��<�|�}*�}�O��9��5��k:m�S�N�B�����T������|K�F'r��T\}�������Q}zrz��\u'���6=���.���u��U�L���������B��f{a�i��s����"�N�o���|�����A�t����$�%_�#�`:����ygi�]�z��n5�
���r�z\��q��o^�r��/���_\���/�����D��<HAu�r��,�r���P8�27��N��h��yU_���5��N�����Mhl/���g����l��o�I���E���>G��r������n�sQM���g��.������}h��C�����[��o?�������P��b�v���$o�G����R����0e�&�����4��3*�CP�=�������/�J\�v�7��D�����]������(��z`!�"[+��a;���!h9�sT������L�$5"|�M���VU4�������-
��$�q��������:,'
���C�����;\��F�z^��������`��c�i3K�	��&P��*�%�������<Ff������'�:�Bu�f[
�l����j����jt|�A�F>���&���f
S0�wA�E�U�Ka�Y3�c`x���&;��$�	�_�T���Q��fi����+�����(�9�
\_�/�p�9�s����t��y��[�>}9=}*n4�Z�������U��������)S� �F������������{;���c9��N����rxNlRvU���(z'���w
�������f�(�J��j�(�J��n��2P
��J����;������^,h��0�J^F&�"��0����,x��d���#X�s�)c'`�l�����M��.K���%9G��Y���T�\�v�����Jt:���-��z��E�XV�����HM����8���>��O��4���>�F������4���'�\�_�,<_��/+�;�-M5	�d$���HH���`]�H3�������|
��+�Sk�_~�@�,\����X	���=ueW���u�e������@+�^N_'����A9xR���SG��,��o�C��10�����������������O��e�(�+�������Z+L3�^�4��>���f���>
�r6��
����i����lWS����K�hE��3�N������,���������b����
F��oZ����;��fem.mIM��9=}���	=.Un����Y��s*��}�t���{�-�z��W�+�;�*M����c�E��������L�+��3avG��jw��vG��hw��vG,�#��(lw�y�NA�u�n��������}���;�RX�:��C/��f�(�H�i���c��&]u5����.��,�s`b �0D�hBM �	E4��&�!�p%��h�6P����s�� �E�E�����U=u����l�����r��s3]����s�9�9*�>�q��l��6�Y�yp�~on�gia>m���/kH@�%�s|���w������c��~�U{�����4�����7.�	�����0�@��g|�g� ���ggo������p���s�^q��P<"D��qN��r�~�E��'��2e#Rgx��io'�c2����*b^R4��d&s9�K4������w%���P�W�&[B��m��:���0dYj��Y~����Kjx�c%<�T��!E#^S�9w�%��2\w�#��%	H���4]�&;.��I��������K'����)o7@G*��.����`�r�U����0P���2*2��lUi�5���.��U��e+[A&�l�t�B�g������3����c����>F^�#��&��/�r�X��&q==^
=H�#�w7F'�GfV��V��V���>�`k�����>�l�����.p`6u��hS/v�6udDnS,�������\Z�����7���=O�
/s����o�+z��KJ�e���@�3��p���WQW("���.��t�i��.Z���2����]��������l/vs���$"n\	:�p��7������N����� �b&�������8�B�+�^8z�O���\��
�}�7f�A,''9�?0����������xx�M�aY:Fj���C�p@�
K�-^����K��?|� P�~.����D����p�����Z9���dA�~�J8�����R9�m�����VwG�5��9
.�j�������}N�����8��U��\d1xk��G��#��Nm��/��<]_{��y�����Z����;u1�����X�k������S%����������_R+���Q*����� ����X��[3�c�!oy�%m_t�V������\���a+rX��pOQ��������\)-�38
��f�I��j@]���fs#W�V�L��n[jIr����[��3t�tx�Ji�#�����T/p�+��V�-�������j��C�uS7�m�2�m�S�������fdFG�����OaGen���M�j��7&e��F���s���7��&���@pn��M,pn��� p�m����k�$'���Or,�"�U�� &��T���N�3�+9�������	K�kv�"
����,-� }�{b�s
����rL������,�{�#���.l7r���7G�����w�������\l%��g�Hq�������T_��$I�,!��EV$3�l���f�H�3T���w�������]�������!&��6���0kI��$�.h	���=mA�p�F,�H-���)|\ ���z���HM�W�mu����Z�iPwPc5����Ei�A=$����$�|����MP'��iQ��R6i��m��y���$�J[�0���B�MF����*��VH���H�,*��l�&��V�~YK��Mj?��WH���z��$��������G\����I�����eaV�J�ro]�
��*���_'��Z;�� ��J$Em-e���=����[)�W���
������������3?��sT��B�$+�����^J�F�5p�ujo�2��K�oBk�����uOx�/��u`-$���X!d���]�m���'���������?���h+��M�0���six_���)V��aA����!�c��}�*�.�k7��aL��q4PLGx�Ux)]��m���hY}0�#����b��@��$��X:H!�mU!�*:8���mUVes��S�	t�8C9�a.J���d<�s_u6���5�����P-E��xh�
]��'�k�Zj�&�z���|OxjlHp��M!X��w�����f��\����Z�J��n<��h@r��Z��������*�����s���k�#��q�^��D8��&�Dz0&��U�����+G�����J���SfC���l��|o�����m�8����<��@$���=�DH�4]�L�'�L�����Py=�"������~�S�t�Aspj���r.eTT��p�����N�FW
���2T�`<��GI������T�W:-r�0��?����'t�	�}����_@��W�~Vo�/�����+g�G�kF���~�n>�����/���<���M?m����]SlH6�1���`�Z&�Z�����,�myN�	?�re���@��;qG(����#w���X�#w��A�#�U����m������^3�N3��;��v������:./A�k<R�Sox���,�oo�~O�����cCv��M�eul�>��q6��
��
���.�=�y7���������	��z��}7������������]��Z����X��\������	*��4g@grzE�As����9���ctW�1Ox�'��M���|�$������M�45����9�+�������e���tqq�?�������cu���M�]�����?*�$�P�����w<���r�
up�:�exR������b���Ot�����T)�}$���[�O1��������A2��28���q��O��
5S.&��J�����lv�u���K9�
%U�������������F(�$x�P�<B�G0�A�#x���#=�-��Ew�:�u�X��(�c�������L������l*��sQow2x]M��p��f+�nH6t.��\\r�G�*�d8T�2��J��,����d�K�����A������}V,k�>�aB��P��E�bs��f��3 ��lJ����K����r���;VV�
]�[�XY%�����(A\J��y���m����
]��������P��5�;�8��_CY���i��|�*���4���v�Q������+�-*���8�t���Xo��*C^�'�1���iJk�t&��k����������I0w���������\	:�������v�9��%_d���e9�v��@�!������@DF�~��s��F���p��v���)�l���f��vO����+|=
���M3�QCA�B�L�!Tp����`Om��eR�Y(����vu}6������w���_E���<=���VGzQ(3LHb�����Pln����q����MI6;��n<���u�/�\74�|��l�u�&�����n�t�uC3����dJ	�S�X� .%��9@Tq�Ym����\�[�x�)\���-C��T*�G��iLLMj
����c�d��J�@bOmp��E�d�~���S�'��=����}�W�/�{��:�7���}�Ch��	����������"+���3����@��U���1V8��������`(��.U���.�.��!��Ms�xbz�[��e��IKh�|l�5��>pNC�
��f�����'<���:�.�Y�{s�wg��%�������Ag{�1��	Dg
�*l~�'�Q��;b���zU>�����C������������I��q�x������C�z�j��*���^)����'�t2<M�{����#�y�����l$���X�����)�k9����P��}���8+����������)i:��7���OSV�\��,n`T4�e������X��3b����E��a���INu��[����������O,>��}�����
�mUI5�����=��i���������47n��������j�w|k'�Gf��v����(2���%G���L|IQ�%[2?��6��n$&? ���7��s���?'2���5:'��`4��<�^�C��(z�^���b�C<�G(���#x� �<b�@��xOQ��z#6��o2�ASk>'����~����2�M��A��W�;u��+�7=`FH�� 4 8;���}�e�g4����y����R�7�{>�J���7���M���"y~$>f:,}�W"�1-+��g���������):�2���ST�����J��SX�����a���x#V�����vW�<k'�����6l/�����c8��z��8���M�r��RX�.��/������/�wH��R���s�d���N�[�_���f'�f'�N6����������5�.k�/kb-k�.k�.����7��H��Cp������F6g��S�� ��#�@{��;��	?���
O{|�������AWx�I�a�1����"4F��������������.��8)�g�S�Q��*Z��t������w����������v�L���k#���?ij"�����;5�������3/D�4N��b�QV6�U��CY�����m&�D 9d���C(rF��!9�#�X�!9dFA�����@G��{��N�m�� �/"X���y��8�q;��`n���@V2�|lr�������O���y��0fG�7�d%�l9�1�c���omt��K��� 0&b��@0�L�I���������������:H��0�W���#ox�w�{n�p���$�$����:��u�%G^=�Z�$q%�I�(Uf.������y�d�|*rq��X.E���u,H��\5>���/��t��7��Lt�9y����g���*�}��w_��g��\MZ�(��&}
��>�Sq��7W�B���S���!�4+0/���.#���|���I�1zd���������j8��J%L�����?�wI(~�+�7qr'�Z���V>��Qh.�S�*��{�w�Bdq�f�"�v�G�c>m"�W��,y�4�SR"��:�	��*�+���DC���Z=�"+�����E�C�7��ZA�
 ��M����[�����������"K5dY�g���Az/�(H�{����?�?<�7�
/��������wl�;�]$��������K6_��,N��=��Dr���C��{e�&�:H�,��^�{�1�����>��� �T�Z�j�@��T��Y�����c%���W^�L�)}&�6{"�`�O������|�C�N��Y>��:�����f��2���l
I�:�F�"�%��J}d�	=b���G��#h�$u����I�M�������*E5~6��Bi�����|���!�Vk�R��`�U��[%�
B����2L��A��|���I�du��kU�0#��[���z�Ko<57�*:"IB	D(�%��"�`��P�J8B��PJf����-���}�]��{�����
���Hd�����*J��qa�Bm)����`��p�2�m���N
�'��a:5����*+5��A�p.m�}����e�J�<�X�A��_��|0�V�bt�N����/��8kcfa�x&�m����}��;l���� [���H�����O��m�� �V���x�Q|�%���&����:e�
��D<v�s:/�{?��8���5{`��s?��''J�1���,�/�	����y�46a�0�M��&��&��&������3�Y)�����]k`�������T��MsM�"�1����pj�����G�X{��cv��C����z�-��t(W��a�2rR���d�����/'wT�`Y�v
��N.�X�0�m�� �������H��x�:��w�`D���[%8h�2�o����>P���V��y������F����	����"�f����!en����ZS}`����YE�o6�K��M����-����A�t���}��L���O�B.��
d'�_�v+y�]��Mg�N�J���V���������/h�������zA}�"���u�GU���S�����f���j:��)�����c>��;�����
��\�����	�2%��zg�i2�b��4mx��~c�<�rl����g���0�AI�JE���TN�(�iR1����.R�`�3q��cx��^�����Q�i��9�\�3-;���0�eG���������(��*#uTu���?�M;�	fa9���2��I,�k�A ��]��x(T��^G��W�d�7���Yy���B�D8�1�D| H��X@"HdA@����*�����:�&n�u�[�����V��x����*��Y;����Wh\��7�������p^k���*Zm��MWu��3��*:��/�^7�� ��)�W�A�u��;��x�[��<��+�~2|+������q�.�{����5�q���q�~����9�@+��|��\�7�m8�y�Ao;6�*�:�fs�>6y-�vL�����i~J���~����/��K\��N�������y��9<�/9���\��d%���TZ�`yd� 3���ARe1�p)tKdlv�b��k���@/����<����h��s�f�^?��(�O�P7�2������=[��������?�8fF����,��v<.�]���8��.�[�_�5��Yq����+�M����N�	�����7��Iu�A5�d�����fM�l~�I�1��{P�(������J>�����$��O%f:�t�l���=c`�����pmq�q���B<)uu��]���a#z��^�����;��~�o%����;��w���v��umtp'��] ����*����w2~���pP��l*���kS�ZOu��lz��]�cFa�|����5���4:����Ys�.�-A:V�U�}?��{U��m��.�va��;�N9��1�I�?���.�#�v�z�������n� A�"�����zcn�O���P����cs�2�����"��cE����@&6�1������(e�^��+4��E��N��BK>M�Flb)�}����tPXS��'?^+���,so4�z����������^neL����;`Cg���J���"�LS�s�\�����g���������ZyZ�Q�JekW���6��t���
��uq
�hX,T�{yp!�c
o)�Q9���Z��N�����2�3��ji�����l��i�aP��L�`Z�i��LgST�]�k��.\�X��
���7���J��P�+��{����h��ls��o����/��Q����n��ue����T%f�|��E�(��7^y����A�`��5��#�1�*-JO�w��t��"Wc��b�"A���<�Z�/,n���x1�d��da�5���&��kb�rMnf�����-��rM�&�3�	��41\4Mo�=��d7�	*�>A���U�ss�..N��?���%'��z��(1�4��#�M���f���d���j|\�i#�>+�@>.n�uzd�:��Z���ij|\�M�MSl����[OZ3���j�{����*����p����Z]��b�3�����n> �dx>����-�_:�MG�/�:�V���lo� �~d>w��
����I��B�m�dy�(5���s�:������1W�$Y	����T^^����<��������N�n^	�UD�giH-�"����'Y��M{##����{
)�v���QpRs	�luN� 8�LV=:�NcA�X�y,���S�y�Z�Aj'�%�K0[��'�p1y'��sL}q^I���K��5�F���\�v�U��fk��p���&�zu~1�C/�O����J:s� ��s2��E���.�<������!*�b,�N���(A����KE�/\"+�N���_�D��zuw8������\�.���)�h$&4�y�1]�T����5���%���F� dOQh�2�����
��+	������F���*q�\U���u����>	%���c��k|��)�d�b�����*�C�~Y�d%R�O��'s_oVM�6N���e�%z{,��(s*���������)����>��_	���_�����K����[W�h����t�XP�@7�a��
C6�t���Cj}����q���W�*���� ��"�d>YN1&�?]����T�>��/�U�6��:�>]�in�l�/�%��2���MI4R��u���D���<�p|��
�/QNLl �8�����y������76�&I���6�n[%}aIor��[��
0��lqM<\�6��k��q���|\3m��&�	�5�qM��:�am9j��Nf��o{G�+5���H!V����\��(�f:�n$�����v^��M�a�N}xOzq�R���q���a����U�s��.@U��h��#���4�������tSNyA����3�8���y��C?����d����O��f9�K������
 ,E1��P�em3IS~���B�\d!��J1��t_a=rh����/���G��UW�E��E��.�|�
w�T�yH�T�O�/���H�Y�M��'��v���N_?H���}�����N�lCI�����=?>:<�>����+>n�\��v=�a
x����gw�:�1�5���/����z
������HG�����<��P�
�(�Hp�]iH%/�2��OT:��*Y��N2-��s���N�����A�C���d�xl�>�y��?{����?S����\�AU]
���(�QO�z���]����{�U*������VH�f���>����@Z������1O���W��u������/�vbDM�;����\�Q�����Q�����o�H��9�Ul���'b�ln�,�X(����
�������7�Kl�gm�����R$h�_�S(���������I\���cb�����o�$a���d�NjE����p������H���Y}����B{���47k:�w�u�[<�����v�m������}:���u���t�7�y����v�6��Ks�Qo�o=�[)Pgy��Yk�l�961���x��h��&�8i����	N��>0��+���!�@\��F/�/�����E�����!���_��������u���Z&{�����:���2v+����Y�oS�`>�,W���s�Gd.�M�Wa\r>w�xuO9���T�����8;�����A�E�����]��2�� J��)�������#�.���
Q
3�5��l�CxHhVW��I<�~���)��\���tq-/�����h�n�����!w����i8_������Nt���9zUi5������^qI�������]�[�V<rEc�_fT��������ET�D�n��������,��J1�7��Wh	KC3N��
������[9�����[���5�#���~���nK����'��'ME�h���������2{tz�m����0�A����Rb���.����w�t�-0�t��!��ojE��m
���+����U���E��j�����D�S"MpT�&�w^�,{I�P&�+K�*�B�;�`����zD�6vY���������L�B
r�
IZ�:�:�*Y(~�BrN��h�<g2&:�� W�6��aSU�6(���h�t$6@T3�� �b�!�_E`#=4����q�M'cx�O����u��a�w�,H$$-����cp9�2@j�tM���Rz�`�NiL
��r?aL��=�z���A%��P�c=���uu
n_�������A�o�r�Lc������S+�b�IR3���`Bs+���xtK�o��������v�8�l�,�t�#e��X\��S-�w~��M5�snpp�uE��V��`��P��K��)uR�:��U����(�`�����t�Bw��X���|��K�'��qf�L+�B��&�5F��h���@��U!���b�Dt��F�j��'�j���dU^N����/u)��'o�Q}:�����e�G���H�1H9�������j������?�44�B'e2���z����#��xALb�������K��
m���F�Ap��ZZwm�M����J��P�W��+i���5�x�����|�Lz�|n����C�GZ�&9$%���l�7Gk��>�g	FK�=$6)#9sc��A�b�G��n�20�G;KmS.����l����g���p���Z�������>!]paa���l��`���)�4f4�#[������5�!���������$��=X|d��r�0t�u���[�o�&w�d�0���Kk��P?�'��/��
�<�d�%�����*��i�PK��6G���^RA�_perf_reports/20GB_preload/ps_2_workers_20GB_preload_5.4_selectivity_8_task_queue_multiplier.txt�]{��4��Oa	!@bC^mR$$`y
���B�����m�t�����y6i��{����Mc��g����~��]��^&>M�C�Go�o�<e	1Mb,����t�����|wq�Gt��[�&������l�h�6��f�7���LK35�[j�����!g��
y`i��h^-`��J�����5�(��C0wa���G�@�]������O���^��u��#������7�r���0���?
(��@�(���
{��o�8�;�g�8}����m��k���x�}Pbz���o�����sS��i@n��&!�c��X�C��)�&���c�$L�����hn����'��hV������Jx���Y�yG��V<o�������>��?�� ��?x�������o��������GF����t�co����'�6$�dG�r���]\��#��)�'�,ok�)�7e��zl�W!��~|�_�I������_\�Z��J�����pw����x���u���Q��_�a~ND����:�AZ�H��~|}��;�
�\�BH��MY��k8���L�]�{�������������������>���Y��b1�����n��0����4��2oS��������� �?��n�U��6�������?A��\6��-��FX'>��_3 ��SC��Pv���15�����r	��4?y$�iJI��+�f���X�[����Q�v-�h����-m�����Mz�+�����N��p�Xhxt���4���S"1���E�1��7\���<�6!��|w{{|�A2L�fb�}���_���f��S�qX4����������s�y)Z�	�f�N�k�����N���]���~�����'oq=��^O+I�������5cq���f�1��5��D:�c�y���y�g�&���c�4����������(0^rZ4���<���S�Y�X8��v������'�1p�t�<b,�G�4�kC�����j��~�M�<��}2���
0�MvO��8��N	���a`�����O�d�N����F�v��Utj(K���^t��4�=���}�=�)�����J�T���5{�JKyd��q�
��c���!a���=� �a�6N��4h����C��?����&��(}_o�8�f��J{��e�F�M�M|x�F���EoP��������^�k0��"��� ����s
$���{gy
$0�!�����!����u�,��6Y�/^��W��R:������{u��c��p��t���w7�ux���]^E�!q�	'p��D�{:�*#�W�
��T��M�/�g��6������c���}�u8���q�>w�^W�l�{_�����[��u���6��AVo�\�k������][�=�c^������n |2V~�[��<������!0�����yN�����b�N
�B����|��-d�����G�%���������P�����uBv�c�t�4��~�����FY�9���,A0�F���t�g�s������Bsu�4���u��\�#���7�j�)����a�/m/�������c���PI���������SCq��lR6�cJA�������n�d2	02�i�L,�����?/K�HdmgIC��wM�.���f�^���<}��������e/�a��������1���w,1����(��'�+�kr�C��M�C"��>�%�$���j����g����E��v���+h����z>e�{�E�}���r�����c����:��z
b*}�e7������W/�����"�.��W����Y��|����GXd/��>����v���J;:�T
���e<���=I��f-��D���4�S�$��=�&�r
���2&��p��'9N��yT4�OA(<��e`j��{��Ub\C��_� �UHM����)w���8��
��Ju�K�����aV���n���??y��Z/W����� /���W�Is��V��`(@�K=3�t_��K�
�iR��K��^9	�T&L��}M���\���E���`���Dg�m:
)�AO���1p�4�����4�'=�*�	�:Z�7�z�A��Sa����e�w�l��?���Sa��-~2�~���2?���.Fmi�e_��h����������!L{���������.����BM���1�m�h<O#�a��������&�_�T�$�:���{��N
e���`�}�E�c�t�4��
���5o��{/�`������Ky�|ka��!���n���|������<����o��t�}��������+�A���=���0�
���v_����^�Hl�*�F�Nn:%����Pl���QTu���4X$=��lX�1X*:
�^��������{�>{��ZT����!�t�W���G�s��4t@�IZ>s�e���������`B�������m�F����z�i���|���4?�Ak��,(�c���� C�{��Q��v4�����=� �:������������N0���4���r��8V�&�4U���Q��0h:tj<�
;�;�'�|�xJ:2S����3W]���0�;���sC�T0�����'��K���	��Qp��S��������X�&���Id}�L��eZ(f}�L������f�9��D����D.ZCw��2iR�>Ypk��D�G�����L
�z&S-R@�7S3;tZx�cfN������h)<=:���V�r#('Q]\���9l�B?�n�.�(O��b����\rH DX���Py��f' ��0��H�G+w��#m��r�nk��z����~R� ��Lk"���PQ�m�y�"1u.hx�{cf�\��^fv����=���V�,x�lW�����Lm��k�0�z�"2��v*��m�3�2�P�e��1�yXe��X]J;�M)�� ���������]N[H+	�S'h��V2�,#�m����+|>�\;I���g����2I���&�L���I��d�mR��Y���bm�GJ_���-�2�	�����v�	r�,����a���_�Q�/��\F�9-Q��c���)N}t�JG���e=F��<T���!E�F�Wmz(|�twtXfb�D���%%�����]H2>^H
�^H��_A'A�I��z���W�O��7�+v�
�b+#Ue'0�M��ksw��o�(�fQuZ�uTt�>����t"1����t������o	�7�
������h��{M�"V�����d�+F�;����i�dL�������$���H<�x�x�x"%���:+<����n	H \���?����m:
�"U������-��8T�'�N������8<2���l�s�:���=���t*8�n~�A!	������,{vu���s
_H��S����`}�a�E��'��������b��%v7���yd�Z�}���������yx��}�8��w���&�+���-��%�
K��5,�%#q�.\*��sR��t��gz}�����M9�R>��Q
d@|dC���4(�]PL�N�	}T��	s��&�Iu�Zw��a>���jW7g��{�oX%�;T��'"p���wm�8��]w\$/���1p�\*8�Q�s����-0�e�X�Y�J�(��#�PE�f���x�����?z�TO�)B��*�E����q�s�"!�|��B���08&jLyq� ���$e��� \5��w^��u�83�������K���K��DeRJ0�PT���L&�b|�+!�$<>��L$�g"3>K���nl9���s[���sy���hG�s�9Jw�R<Hyg�~%�_	�~%R�+��_���J��+��_���J�����1������<��aTO���mV��wC�N�wr�m���i�"�>���4��]��������3�|����*h������-z�_~�
�>���tQ�\����r�t������!b������\s�����q�tWt�,:Q6�Em~aC_��5��r��
.���F �j��>!��Y ���(��0��]1M�g��(�wr!R�N� �ss~A���p!V��0����F��<�
��M�5*�b��Of�_Fng��S�z|C�\	��O�f����_$|��u���vQ/`w����Q1x*��jC���P�$�JEU�������.�������RQV*z���7}���N��4}	�dq���I"�xh��R�#y1��E���D�����Z1w�\��P��~yy�?�H<	@"��]�?x���}��pQ����Oi�)���<e����L
���I�n�2����$�_��JS>���������K���-�s�4��6�\eXpM����g�`�t�|�d����JV�W�k�DC~LL�K�����d�p�Uu���`�X!>v��c������1R>v�7�	��'R&=���	�x'(��`�t���eX�p��%�Qw�����D�H�����x���x��>�;	��F$����������R��H��RXWF�x�N�Y��fy����B���;=��uD��M�<���&r�k5���Ga0�y���C���O�>Noc��l������>P��1�7�A�t(������������{Ba�w�e���B$� �,���0�ztL�S�=��0�@�r]Qx�d0
�AF�:�,�@"e����8N[s��z84Y���.\�ppx����p��x�r��M@v
u��a���rz�1�Zt
�{��l�x�m���FG���j�*��
�J
m�l���:��pY�����E�69G��^4���_�^��������!K���?P�]��0f�1,z�Fq�.T���	V�T��Q��s�T�k_���b�W�6T�0ga���t�J%�7�2#��Vke*����l����������,�e�*��,U"�z�2�?�e�FY#���]*���X���(7K-�q�[u��8�YH�b���
��!�Fa>0����:
8�q��&"��qp
����/)L���)�����4�.� �\��~���ip���	���c���qZQ���X�b�"P�e����_,H] �
�p+^,���q�b
��	*P��G�f�y>�^"l^&V^*@~����C���(��D~���*�.0�������9��X�v����
jx���J����Q�5*jGY]B��%i���TZ/�"�����2��?Yzr����r���]���@���pm�����nQH�"A�,$>�vbg0����U�0���L�U���Z�E����6��S���i���n�M�
/���6�t�k�����e�K������y/���0g����.����O�B5Ey��.���z�G�z�X>��;u�!��vCe��MG���I�>$���d�Cry�������	�u��$�S2�p��D�`��F�p,�F�p|�G�h���E��\����@�����0P��"u+,�5�k�o�������NR��M��D����^m��!T�M;�|'���6�V���I�&nM�j69����~B�<�����Mn���Kfk���E��N�I���j�Q���.�����
u{�#]2#�<��pt���:[�7/O����)��8E���C�qAI��N���5�][~wg������f�~����0����	mSh����o�9�&���O���B�)���J{Y2`�d�3�����	��V�?�<sX;��a}����%��;�U�����T����5;�U�|�v��4��������Z�������bwxKO�&t���L�@IQIQ�+�X�Fr)��P�v��C�\�P�v��v����u����R���bp�E���Y#����:�t��V3���1�HF������SKy�U��Q����1�#l:�-�[v��/���`�V����S��<�-{�D�2p�hz���b��.����h�1���E�_����a���oU)0?���K��,��t8Z�!����9,���#m
���7Z��PUW!�Qn���~�A���u���o���M�m���>�Jd�9}�������9S�E�]�
K��M9�P�KS�
V9�X��[�J^��;�S�:����b��Tiph�NGu6�����F���������FCSD�%!�?C53,�.IG���]X���Mr�(7������&��BQS��|(q��u���v���w��p��\�s�"F(
;'Um-g&���Q��.����H6�*� ���P��*�����T2Ufny���^-�h[D��]��n;�y6|������\�x���������7����l�/i��$����XJ�7	�����.p��M"Z�Vg���9�\]\�ap��j92�s�$����ch��k������Hw����Zh��n��� [�k���48m�4���=p4�hh
�.&a^����28L�(�t���Ur���*���H���#5���a8rdmP�a��`�#%�r����v�%e�#9bH�����-�aHWi)�������������}��6|���^��:�Syb�����1u4!���n?���>�����i8���.�
h-��9�f���?5�G�w�������|�x� ��$�����;�{On�z�Y���z����y�����%�w���������;�������i���,�z����;����=��M�1��c�.5u�����pe�VKT`�:����w��Q�K�����A�/�%BZj���3vz
��4�`���w[7�B8]
>�_�����zb	���E;����
���&x�����'CQ��;�8�I���0�$eY(��l���P��A��msT;��>T�	�$}�7��M������o4������`���-4v7���
r�No\���Pj�l_�Y�`�j�@���J8���������ShJ
�k�'���Z<��a�
3h�g���.�C"���X�����oI�h����s~#`���0��}�N��o��h�n+���,�a4����@���/W���,������{���,���)@����n\�6o��������������������������b��\N�W���?�f�ja�����0��t4	�()�M1��>HM�'������cne�iz���Q�~EL����A���~��B�q�_T�=S<o�M�B���-��������q<%��h�2�<\�\��[�	�'@��Y����!�������-[�u(RWw��]�9h�x@�!��P��,
��5��0����\������_��������U�5$w����8�::���O����~��m�����/�3��T��&$o[�e��o�Ox�d^��{�b�&��&��&��M
�M����7)��"��"�uF�K����P��k��.O���eBP�~�F�pA���q��F ;k�TqPB��CJ�[G�5o����{�G���Q@m@N�����:Fi����h���H�A���v�:A�+���f�X�{������bu�s���&'A#��h��2z�;�]�C,!-gu��ha���������y(����0�����D��|^�/�S�%�������q�V��*���Q/�X���f����K~n�S��U�B=�b�W�V&�W<�XRqC�(�
M^CS��J���g�f����%����
����A�����d��M2|�&9�I.c4��R�^��D7�)���b�nx�K7�����fQ��S�Qp\{F2��R���+UN���Z��4RFV�*/Ik�u=��@_���:���FwL���/iq��c����\@Rr*�jdJ�K� 	���.��a���e]�k�$,���j_�t�E��9���/�^,��4:m�4��>q4�hl
R����������0g^���)�P��������?{�s9��|��9��*���Gy�������.�����@�|J�e������S$��OI{(�{�0�d/�e*�Y2���0����{a�.-�{u�yY���r���M��� �������k��L�S_�����na6�W��H���AV�� �mqP��8�Z	��8�u10�>��������C����S�g��.�|C�G�����sJiO��~��#�~;�y�U����QX�(&�(V�(�m�6
k����
���"�3������������O�f�IgA]��]��@]���]���a-,��k��ZxXkaaG���O��gw��o�`8	^��o.6���y�^��}e�j{����_�/�J�/��o����6��0`��k���Jr���#%:�m�|�RR7�}�N<K�������X+��M��r�`�Wx����*|�/�N����"C1�8:w>��
������:}�{��g�_<{����[�G�`�����}��e*s��/l	����/L	����3h��Ko��Kj��VY�l�o������X�J��w���`}�[O���r=�xQ�����|7�l��}Z=6���Y�C�gI�6��\�<���'�qV_���r����6��?Muf>]|G�s��?!���Yp�E����7��{:E@����&��@�0�MbXC]d������t���u�.��r�_|�L/��l����oKx�,|������G�(�A�U���4�E�wiU�P��r��4�Ic1��3���i��V����?R���<���#Ey�
���(y��j�-�5U�����
�Wi�S����5�E�5��5���5�����G�5�f��I�-�xkr�}kr�kj����'��"dN]I3u!���S�������Y��k��w�<.���Y�B��}���E��U��pV'�A@���A2w���#�j���G����J<���V�;!��W���������N#(�+t��\��
��K�$B��p�PD0�!p�#h#|S#�NF��E�=������S�,	}�Rz��Y
>�+u�w�V�S0"�a8��Q��=���
A�&+a��
����jp`���m:���l���=_s�W���	�	b��2Av� H&�0K�Y���P�x�GYAF���C>��N��1w��/�iI����7���*�l�&]��[�X�75��=[�RE�W5�i1��6Y����Nu!�_l��� ��K��M���d�V{���u��;��YNV����������J�Z��|�(
�4N�O;��K>�c�
�U�$��@	��lt!WTl�@Ee�+�c���-�������y��������_������YE*��ff3�-�Y��a����b\�j�l�>�,�e�D:�+�(yZ�������Z{/�'��7����8C�B��`9��8$��������n,;x�U�_\��N���i�?�9#��WJd%p��(uP��k���h�TF��:���|>�=�9�q��� UG#����U#)��6����#;�������O��XI���N��0cWY��/��E�a����G��5��
W~N�?����m:�\��B���=���f\���N����!*�}$&m��<�Re�+�.��*|]�z=
}�����J�3������y�d����t ���R�ly�<���A,�8s��#�#`�.����ILa�4�$C_��[p��6���$����a�U����
����A(�I����A1����&� f'�������
:�Ad��l�������1���^���"��'��{��**C��W�����?����[����p���:��kC��{��pd3������W)�E���$kpv�wc��n�a��ay��A�z7��wxN�n�3�
d���!��{v����t42��+�� ����/z'����X��VC���1�s`��@�����U��@\^�����r��ITa�CU�3\�P���b�OaP-T������q�@��y�s��~M�lM�VZh�,F�,f�,F{,�'��)F���o��i�~��n��}Y+��Z���TR��c���s)r�p��	0G���y������-id�M�|���\�E���]�����C>�W������&�q^�I�:���vi�G��^!�T��@��T��A���
D�r����}�����_.���%Z�/
S�	i�5��Jk?,��[�b�m>M�,�����~��uq�����l�%j�
m�Xow���6�YJX{j���Y[��@�����<���%��G�:��>)�������BY����U�;����:���r�c�rG���>���"xS����B���8O��,E��7@>_���H�S�������*������P|Z{f�t��R��$�;�e�|��/��}��(���=���^E�	�;��P�#P��;���@�#H��;b��rGv�#H�jB��d�%�p��|��6�#9��peh�?5����	����4�7�>aM�x8UK���"����7��$]��~����<.�R��|:�>/�yT��u��b�;�R����2���mW<6#1���|�>���FW�eZDy'\Vu{�_�x�~+�����[�i����K7^EU|�6v3�S����^��f�Y?���=7��"W�}|U���C�w�\�0
Q�\���8I��x���S>�����Nw�z��?y;k5a��0��&�=0��!��������a�u��y���`��}�,7����.+T����U��q�O���J=U����]�e�g:[����n:(���k�t�{T��\����sW��sC��E�%�z��������&j�mz�����Q�����5h�T����8�K:�i�H��2t	��B<��&����23������\����=���<�sW������%�`:I(��a:I0����t�8��A'������7��{S~�)���G�7���i����7	��xl/���i�'P��O7��m�%�]��0c���W����u����"+���m�_!�i�<�;��������������v@�h�>�1��� �^���;�?+9��:m�������Mk,_0���G�������k��{�_��[���K
_j|g�
����(u����e�����5�D���0*�:j�T�s$[;�d'(@p�`�R��x)H�����������	Y|P�O?��pmuJS�����o�:�����t�3C����9��V�7a��X���]���8��{��"g����x�����p
S.��`�����J����41��&���EX���&������L%����M1_s��Jz"���+1l*a��@�I��$�Nj	��0V�l����i���9,���R8T�ZQ���GF��po�e����z����J.��]��n:�,�i�'��9��Y�0��qg���
�+��P�?�1���E0����Z��P�}�k���]M��[�#e���zmG�����
m�mA6Js�joa��,L��@��z0A5=��E���W�oM�
IG(00Y��8�(|�(���R?��4�e��U�hH����O�H�B/�o��x��>>P�B����������r)'�w�� ��(��Z@N��XjLl�$Q|��
���1Q;�I�'D��.���~�I?���
]W
5������Z6�f�k���U��M9oa�k�yZ%�N7[�c����)��Y{�3���s���b�bi�9_��[r?�g���+��&a^X��)"gk�U�s��(:m�o�a��On�|�y4N���#��+��s�_���)-ud��x�wK�0�va��b��j���M�����:�����qV,B�f�u�;
/d�}�Z���I�!s�iM�Bo�L��5h���4C4������n�4���	!��-�\�@�J\�P�M����p�=���xg�M���x[���E'����x<3�y��x��U)�@�!�e���m�/�|'�_�6_��5�R~S*#�qR���A.�bG ��,����I>��u���2����:#�u����pU��vW[?���(�+4��r�l5I%����j�ss��9�vTD}�����)I��%&D;���i1���Sm�Xf;!���{O�&&Kr��@u�`�y
�]���_�@v�8PIg0�WU�Y�Q�I�]�K�Qq�_[IE|#��H��%���@�Hm<�2�WhB��Q��H�	��u���J���'�c����c��)w��4 �H��I����9C��,^"���W��+V]��n����"kaT����^,�u�v�y��MT 6B�1�cW��mF��?.�x��h�qz7�!��0b��jRxl��{���?���uU��#�_'��3)@��v����e}�i
�Sc���k8}R(��4���S.�+�n�.�.�mnNs����&���k��V9��vT�H�����@*����M��K��K��;[��f��o����>N�u�t[�yF*b>'E�h�N]8�}�H����}����UK�%�P-�����f5������3R/���jh�%�m�k%x~*�T�����\a�09'J�	4�e��M]*|�De�
���t_��^�q���#Wh��E�zn�G����������/uSS��-�:�`��5�M���F��V�
����f�^M��U�����3�>��;g�>?of�E����7�����L
�n����&4�����Y�=9���s�����j������ ���lg�G��w�J��n?���Tj�������	��&��CH�q1�i����=��r���s��Np�L����z�{}lPU���M/fLY$�U�3���%��)�zj�n�{���I�>�a�!��0�������5��|��S��).Bj�20�oY��E��T��.>F�
���������VR�h��O��v>9q���g�9���M�%���=\m7�gu���
�-���c�]�ec���n����E�?�4J����^������s����R����;�Q4�5�3�����E�)���JJ������@f� ���'%kC����#Y�	6�g �b�(u,
D��]b��^�QV>L��Y���%��4��6���t�ih
u^a�5�����������"]	�	�H�1�~u�@��R����������l�r�Q�}��1wYij�&����|RE�N��Z3�Y�f	z~dl���&��%IP�R4)�U
�(@� K�HRt)�Q
�/@����>[���r�hQ���p�- �.<�z%�j-���R�o�<�+������	8q�U�����p��Ds���9�i�*;�yS6�T����*=�l�������?wD~����$��H�qD>�*aH���3�?m�[YPb7�{T�#se�Kb�rto���9F�������"�L���7�-���T����8���z�<���2��H=x�NX=A�
x����P�l�yx�C:VB-|d�K�G��;!j�#��[`���q�g$��j�����-��h�.��b:�#9�������?wO�Y�%��;b��g�z=�/�������p;-��"��{w�f��m��B���|d���G�s��R����T�
�8�Ai
��(���(�)�\���U��9#1	������a\B�x�����1��}w�c�R]�'�xv�^�����d��#��l�p^�������L�OD��8`�|����i�}:�����������#����09�V�������B\��������4��a��[����>���Y+J��#��0���N}\���e���L8�n��X�����Nw�&4VW0�$�����`��Y�{�S��i� �sn��M��h�>y3F�k��7���������I�=LI�60'���&�
~��|h k�<[$]||~�qS�L���2[e����^}�	�����xF�_�R
���s�st��;���9?H��M�]����8�
����'`B]I��qm������H�S���Q�a�g�]�3�1��TB��\�Xq��6���6��UZ������c�����������,��X�Uekn��^W��|9s�	a2Mz({ �
�|aBY��������).DS�}p&���'3�I����_�2�1�!�yk���'vS^�^E�h���Z�u�jw�N(}
��d^H��m�+Nz��5|�������F����#�[��*B��K�����x����B�0"n�����7�����v�����o�2��~U����
<!�&�0���m�`���&��Vsk�9N��}c��������x@e< 2P��D������x@d<�3������ -<AS���~�}>gF����_�v86��������z<��u�)�)��{d����g}���7�p�� ���P����=��[xKgQ�t�<�R��]'�R���|��q�����_������Q�������_�8o�����p�Y�f�U��62{�����s�\��7<�-���!�l<����N�����c�][(����^����`�t����p!|\+�����j�)�����8�I��6OR,]_{W�o��9>Y)��s�"�Q�H���Q�a��w�sJJ�	l�l�����&PW��7x 5wP�%��9p?d\L
�q2NOm��OQ5~u�]�}L�o�|l����m;��xy���.zS��D��a|u/���Q�x�i����S����R�h
D*��C���@�P r(�8�
,"���C���p�=�X������=�~��X��D�jb�\���u��dv�puY�V
��Z���!0���EL�"o�,����5X�j��lD���5Avq0����~r��������3?k~�x�[$�A�):�GA�1�H��;��Kt�����cz�Eo��������.��Vi�8������"��(X��yt_��w���*��Q��zX}!����)s���g��V�9����,)��p�yk��Z�@�
P�^��-�l���K��T�e����.d	uxV�=���e��w��6�v���[�Q�QU��0�����@�<3Zfe�g��(^��+��
�9�<����5��>�H;�i�KC�|��KH�E���9���AM���_|�4�����2�#��
D2����E�����#s^���	�	��dm��z�C�F�y�K��L*�q��E�g�4�o�y9����L��B}r��.8�r�qMH?�:;?�!@st�@E���z��Y��3���#��?���<��mS��������������:N���x���Biu��[k�\��{�l(D,<����-�Bx�����k����K�E������8(�WN/{�!�*�
� �#����m��[3G��k�? ��Pr�p4L���1���DRG�u,OK�4�q���:���@�o�	1Gn�Dvm"_��u�.'���	�RQSgeW����N��6�1'����[�'q��+�qKy�Q��e��GS������'v�.<�\������E��m�z�B.��XGi���v�Kz
������l-�|9Jo%*�V�i��M�����\�Fq���T."!�����d��q-zz�$�t��d�4��JE!<��jj�wa�xF���!�x��^������}kVwxNGM�i�(7u9����Zf�>9�xyI��7���:l�;��5�v�i���2�:��pA����0�-�Mf��5��u<�y�kg�� ��0��s�U�xG��wt�5�������U�����r�O�e��~k+)~�����v�.�ek7$ �Yn�g,��:��`-������j�:���yx �G���R�h�Z<������#���{)��P+��\��=��Cn�����:�����x����������n�`�|�h�|��\=���U���9$��^4�8y����t��������<�%��	�@�fY�es�F���sB�=�����w��������s+�r�$s"|'�-�{���������o�m�������X~7!foW{J��0����[�$�*���N<sS������!��}?�-6�G��F$k��v���&��m�q�������<��,��2g����Y��VX�	s�1J�.H�eX����
�cR5�E��J���y�������:�D��I�x^����xinBg%��L��W�8�3�O�m�5@�E�j����9���:�+k2J� �]�'�*�qn���D��C���Jj�������d=�\e�+��xTr��\���U�r�,how^���m�`oN���������P'�OSGx�,��_�D���7�����[���dDx�������Z�8M�L��g����3��'/�N�z�P��Pi���S�����0��!�>I\{=W�(<���2}D��y<���������:�V'uwg������c<�x�m�*M����H���H"����E�������1�����n�2S_w`r�z#�>Z �Mh�)'��c<�7L�I���N�5J1o*+0y���OR�����]W��v�r��2Y~�����B\]����w�����$�p��n��\�uh_�g�����R��A
<��l�^,�0m���PK��6G�6����_perf_reports/20GB_preload/ps_4_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]mo�4�~����]6���IH@)P�q�-oB��&�64���������=���'.w�5�����=3���{���x��Y��1wH<'�)'W<"�I�����|�$���B��P$�p���H�0�d(H�}��<m�f��y2�.F�dd��?�/ftfA���.�����`o���W�,�o�dUC�Q*����v��>����������
?����\~H~�a �?����MO��tL���1���9�s�%R�?C������&a�|���0~��g�eX��"��I��)8}�����|�=n��0v����Dd9���(�5���~�}.�I����cb��������/���,�{��pS�]���q��)���s����qL"��	N�(��9��������Wg�~~vI�����������g�n��5[������'��:��D������/>�k��WlA�@�_��-������g�	;L��X������d1�,�����j^���p�BN������qu����\����:�6����G�A�u�����}��hnA��c.���~%_q]��pW,���q�����F�gM���_''';��rB��x�%����D���N��?�/-FS4�;�$��&��c8����t43�$�&��noh"[�cHt���
7Tpi_�k/����;��\��|Q��v.���G��*�;s�����*Z�u�-LY�P�_���7 �t����7�]�#�����D#bkN�8
8
���i��5c�esv�����v�%��|^�D����/hMF�	���|��.���T��h�n�x��Y�T40���-/���a�m604J|Ea:�R���oB��S�U��%�&��a
�������/lP��R�x�����M�pu�. ����#�k���(��ba�U>�w)���1,J|Ea�B��#hO��?��&$���c�;���4*�&)�5��V�*	c~]; X5� 2-��}��yd�a�Gc�!�������r���������|�6�&��!�
�o�H���(���"��h
U$!>uX�nh�rG��|�h�y����,�0|�)�%����������H��@����C}�#����bd�����u��z��~���U���S�������n��A������G�����g���xl-��6����J����3�Z��iL�kY���R����M�J��ri�Vb-��~����|h%�eQ�a��������uk�K66t+q�0�P���`)k�p{�[����fb,��b�[�Kc�[	���v��w2���X,��:\�2[:��&��:Y[o&��v�l�0�>^9+W[����`��6��{i�r2�2����1Y
Ek��������l>���6&+�a�2����)������v���L���t=��|d.�e=���/����0�~�+M0ev}�N^��A���bf�g��M���Ykm���5�)�^��O�s�5���E�&��]�m�N�S/���A9��_�\�K,h��
T�+
t���
����8��;NW	
}x
:r�������5	tT��p�J@6�P�Y1$:��ZD^dY`��J%�p��5�\�,,��s�B�9�> ���k>h#��3�W�>%��`��K�\��X�K����� ��)����������������F�:���_�
����&�e��R�q��t^��s�uM�Y��JZW��F��V�mx�:����L�s�u�nd����@kK�4�����DQ�����d2��!���d��U�8���%0d
���a��������Aw�O��`'���N��q�N
B�J_����fJe�xQ!�E
�e�����s�>��������<�I�4d�t4+����dw2aL�&����
`�ier�}b��B�T�K�c`�m���Z`5����9�6f=�M&��eo�yg��|���1��{������[Zt�zO;q�-O�\=�=f�2��7������/���dzf��4H7r�)��tS\RV����e	�X����}�	�
��d@vP�?��0$�x�{(J���JP�o���7}����h�;�v��r��!���������������k��X��������� �u�V��`t�8�4M�e��+sj�qp����������I{W��F[m ���2�u�<���Y�gP��c��)[C�5�B�]����: �n](�P"�_�B����Q'�l1���&��5��va�� 2���k���^@0}��L}dz�|�T-��:�lf�I��|v�+J6��G<�k�NVx"*��/Xb�vCd��D>���u�<�I��`P�D��������k���5�D�2��[��' ��D���u���E��M��LE������H�jTA�8��*�<�M���u4�S�*�So����MB�5�ew����M��sq�`�k�>��(
b���:-r�^;K�oCgH�o��#rc�}���-����h�<�/]m�����i��`
dsl.��h9�"����G���[T	������@�����=�;�c ���98f�)��p��"t/~�A�=L��<d-'+���7�3�K[�VYQ;E��?���Ni��@.x� �_(VR����6[���$���T���]����������A�q�F	�k4,
���Q*�EK>��������Uj��+���r~��N��E
 ���E�l�����(��1������������<�^�Gg��Y�-U���h��7��fQ���rB��qH��C`��=����������>zv�"��
2�����(����(	��4d��G���u��vg����7�4��.�B���'�14:x
.����,�K���S���$"u-�N�SQ�<�H�$�����^y�L2������f�Qh�\����%US��S1;��F���A�tw���?v�e��_K��D�*:������iT-	��"���d��ST��e�R�j<��Y��\����������b�����wZl�z�rtKA�G��Q�p�0�����Z�RuP��5k7\�w���"�M<��B+��:D�5�D�B5*��?�j�����R��g�Q�+]�k/�a����OX���l�P5��fT��*�V� �m}�
��f�A������h���:��\d ����(M2���E)���w����?� �Y�����kp�_�I���p���a~:V�U����c�G-W%�����I�AC^ �F}C��k����w)Oy^�|y�D�A�N1;{�2M�Q�����m<fD�G����i�o�\���.{�(�+b�IuA	�t�F����
�o�\
������!0��lgb��x
Y�:�[�����kp��Te�[y����0��C�����O�L"�h���
��N0%���J�U�3?��V��;}���!#mtUG=}�q�d�ZW��_�5u��]$�k�P��k�`3�;^�fv�E
���G�oSF�j�f�y�<.�)������9 Z�Oe�e��Y)�hHm=�>�[����-�����}�{�>���V\��]i���Z�^��k������#�,x����b�������n����?P�1�^�B��j#�am7^'j�9�?v�;Rb����C
��Q�q<
2@�Q?~�ZQ�f���(�fo(���"�f����J�{Pc�z��Ne�T)t�E3���xQ_vY��5�
Ut���p�/.��7�������a�m9\��I�$�1��Nb%��<
7����@'N���:�J/���b>[���*;��.,�q}[C�a�0���T���4�C���5 �����>���<"!�a���������j�V����z�2������tQV����������\��Bz����|mR�"�]
��e����l_�>7]]G�ef��� ��9���+�s��0���0U����}�b����OqC��^������E����=,Q�*O�/���K�9|
��<��������5Z��LZ�2��/j��i�;�V_������'�7�B~�,r�N��L~9�0M�K�b���V��?������.����z�A7���h�,V���2��#��u�[{P�X�m/��|Z>��!���sGm��(�����wY��(��"����%z�i��R���	���v�����t���"RV��,��z�(^(C� !�B�)���7�{�T�>�o��d�E*&+�?L"`���=z�,^E
5F��J9�$e�G�x��\t��������ni���Lw+t]�*�]��=�@�j���A>�B��V���5d�)�<|8��y�i�����Kz�ud>��|Sf�c������'�x
1��
����s��&��)R���
_���u��_G�h��^]�0z�4E\wv�OY�o��o,�������;�M����,g��C2�M�m�6���<<a;K������1<[x��V��S�<+b������0���������|��d��l��-�^B������W����>�%��>����E��6~�qQ ���>kG�{���-+	`�/����/+���������iDv��;R���00Fk��'��`�_�i���_���
*�g3���;���j���h?���u&@87��gH���!�m��%}��zX`{T�sX{��?���B��y:j^�i�|���1Q����M����3��b����V����<��o��_
R)>n<��B����"������O�t�!�]>%�>�Y�x����$RO�|B�q����GR��C.zt�=������������Y��N:���/}i��~QI�|�`�(�Y�f��������G)}����sQy��z9����i���Ll�C�d).c�.�^|�8�E�>�d�S��B�=��)� ���N�(�7�ik,XwM����tQ�7�� ����.$YJ^�5��[�;��W3���gp���=�l%��x�����}"3g�<�����Y��K��#B�'7��?�:�PK��6G��=
�F�&_perf_reports/20GB_preload/ps_4_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt����4�?O1�HlH�-�!�S ��l���m����.��3N�6����(|�?v��O��<3�_�7����G�jS���u�f#�k��������>��cO<�w��U���Z~p����B,ET���������Y��r,������.�\�b��(�<��k%�zs�-�"���{Y�7%��L3�<|�G�r�~��2��e�X���+���3���/����+��l��MD#�?���m�4�&�l�����������Z�U���X�����w����\�{[�]%
N�m����4{�~�B�y���
\�!\���^d����
���(_��q-?��B�g��F>?�?M�?:���!^��%~�}�E��('���f#���Wa��������4����3�?��{����
q��o>���O>��H�E���z)^�
���G86���z=�}���o�(?|��8,��k�������Vx��"�7���h�.�����l>�'��k8�#���(�w��*���i��6������"��Z�~\]�K��j
?���O�G����{��8�7�([O��G�'(�y&b�������{���8�tuuu
��N����
K!�;�5�n����h����S����������+���g�~H������z�uX����#N�O�A�
�2�������{� ����.�������|jM|���~�����4+���u}k�����=z���K�.Z.��U�S���'�z��H�**�Z?���
��������`��.D�'V�*�7E��*�*��D�Y�Tlj�eU}�W�7Y�$�s)�*n[����D]�V+���B)�{���fM�}�.��:MT���2�u��Tb���\>�(kXRA2�Z�����| N>P'������}��5�@�|PO>�&�'[��f>WS}(��7R�}Ui�HE��w����"Alu��#�U�u�M�A�;�����������\�Y����d��e�)�=�r��_�R�_j�Lez�H\����B}���r�'���_��_���qX�E?�������19\�l��T�4k�Z���9�3��:��\�������#SW��H�Jd��W��Q��7Qa�����	��h�_��������=�"vptS�F5��������@��FF�(��a�E�5��k�Vg�F;��u|�-�{�Rs�"J{���nO�(����R�#��y��DL-G�#
W���Y�2 �Kg����Q	���6!l��B�JPg4�4�Gb#��V�I[T9�
 ��HC�Cq �	����U�������J�a!��G�D��p�x�s.^�QQVE�:��p�\d�h��JiFq�Y�� ���%�K��
�vEa1c�>�������=���"W���_�
g�����0����]�%Z����/J���,����Q���:�@�S �)���s
�9���zN�4��=�9��zQ���u	��/gvp�8\�W��Mx�Y,DQG�Bl(�{���f�s{��&��M���)�mZGN1�Y�*������N������[C$%����}>��T.�X�����G�:��$�X��,�5���J������3����&�P�<��V�l�Kgx���������DA����{$���Y�+�k��Y#�%,�%��e�z�2��rG�����3��I�]��m����w]��!����C�������nl������u��OT��/BB�DQl�|@����	EL)�W50]�W���% ���� �%�TQ�a�E	���y����������5�Y��/q2T�Dx��1��Y�9��f��4�Z�y��7�
��P�V��k��7a�����(�+�06�D��q�_,���>�*����o
������b����^��B�~`(j�!���v@����]����+7�/�h
�I��H����-G�|9
�Q�/GAc9
:�Q�/G����(���7=�ed������N`y&zs�=�D*<�����=��f�����dd���/���}=k���o>���U�a���/�����v��);k�-�h�ZT����)c�Gp��T6�vI�83k��fLMg������3�Pmi����pxo?���i����7���0#�z���
c�����i[�,�=��O���c����#���N��c_�$����d���'����D"���%G��DT��S[�8����� �t���0� f��P���3Qc"u����L,iK(ae"��rAFkvr�fh�=r��Uj��Y�B� r&���1���1n����N���A���f��
��oa�#�]���Ue�E3
�1��yJ����	E8cF!A�(�����m�Q8{�;u�F��[��=E�\"?��&�l#-�e���ed~~��V���R���RJ@�J	�L-�;+4�X�&������dB����bL������2@Y�����a�u��hh�]=([���):�w<�r0�0�:R��kaNmL�d5a7H��<<���>�lz�Z����u�c��2�N�i$��������$��R>�rx��d�.����5�4d>�e>h�|����gx�2<����4�;�%������!��������t��el�9�3��V��My�f:�������)C�i+����%�2"h �������=�����Y�0`@0P�L�>^e�v�(Dgb����(��������(A�jN�-^+9R��q���63����Nd�z����n��^�Qz�j@X�q���.�-c��ob�|���8^9
��:�?��0�Op�{��&�-	���Qn����q]G7����D�0���W�u���d�/��{��:���������aU����I�����i��~��������g�*^�^�$�H�j�H"�-r
��(z�s��'d�>��!}U��������G�f=�`�P�(�.�?Cg����[�\�.���������~��~��x?��,&����|@N�%��Rk�)��5����e@���=��j��u��<�@�| M>�'X�����i�e��X�s�����L}�<�^}�s@�}����3�k-Twpv�������s^P>�%$3�CH����q`A=�@����}�M&+)���fiv�IN��m���*b�7�oc�i�=���Mp������������w���(fy�����Vv��.�����>����`+2{8s�#Ays�;�G��6~`����"�D���@C��l~����A���N��Qo'^����y����<�w�5��A��q���1��f\���&�l��l~���Sk��l��g�p`�p��p��p0��w([Ob)�"�
M�����>Zn�Z[�S��d���O�
��`��7�A�����?������k���V]U|�,�\M\�p��.����c'�����9���C7>���$�����,[
�p�2�p�'�Rw��������`��,O��v32����q�����#	<$���a:�:�G�]�{q�9���`���ww��9(~9���C�=��S��o2Tfy�U��Y�y��f��}�z��qa�����������8�Y��L����������t�����K[����	L,'0X*��xF�K��Tc�#�3�	L]g,�s�XM�
F��=p_��l��htz��#��N�#�m�|�?Q��<Q���C�|�n�V>rva+��Z%���9�"�����o�A��%Zwr)�����;���&`Z����Bo���<lW�<0����
C�*��(�����C����4��x;��1��I\(��2�����6��L�D,�bV��*��@!���lL��!��d� �)�#�i�R��&���7&��I~J�!�7&��.���
100��_�x��[��?��-5����$cZ�pi%�����,�S	G��6~@��6��
LS	�p
:�2�@���A,��(
�]f��}`d��G�xt�o����:3��)��S����Z)�
�+Ww�x�X��S�1�Z	}�B�j�4#�0���	����!����4�@X`
,�����+&\�>4�\Onk���<9{�>)��NA���l,��|�5��s�������)��}5�����{P���<�$q��#����>��%����OOg�&�6���1���p����k�`�_��A�A2���5qOxj�������X����u>���&����T.���9������G��/3���$���Y�\��5;=�`���#���������/�5�0�'C��'�r����dp=�C������P��AWs��u<8����_�V��	����x"3��x�.L�3���E�����2�������}��[!�V��o>:x)�'d������O��p�6�����������I<Y.}�\��������.��</[f��d�2(�e������x���7��`1���zdxn����ex�!�A���D�{���4De6l�6�s	v ^b[��@y�� ��8MiJs�����@��������y��g�������x_6��N�{W�.��
����C�t�y�H�����R�g*�L%��b����L���xT�`W��[��R%�Q��~u��as
�6���
�e0C	��p���R'._��Q�P�i�h�4�0�gI�<6Q��7���u�6g�:���O���)�
p�v�C�~�����W�c'���-��q�����M����A�p*��>�"��Qd�iwCZ��dC��6J>Bd�����g���d~|���:C���V�[�� !u?7�����t4�Y>��*��K�6��l
�b�@�me�l�,�j��=��H~'�J�0����cp�B��x�O[��RiN-�'��T3��,�6������}q6���Cgp�q~;���y���!?<;H�
���)���l45�:����"��
�\M�* ��`s|1��>@A>y��[�<��c�Mr�W$��bH�_9��W�4������Kt�f/���S��Fh�z���h�f����\���6���6��:'E^�n@����^L�xDAHurv�nt�v�����F&r����n�'1�j������W���2Oi���
�������B�*B�s������-&W;�1���i��lZx����7�Os�)��Z�1��������w:�s����i���y���B���M�����s�'��f����a�-�	o��8����]�C��g������<��b�Y,���;����>
���E�,�6���/���g5cg��v����7�M��~��<_���k����#�/���f��yBcv�7!��:1M��q�5� ���j����dc�l
27,������ib!��h��&6i �4���M3�4�P�������8��)�����W�����
�$d���a6���@���^p
��h�����<V=}2��S�|4��W�|C���f�Eq�>�M��S,cu��\��&���F�I�~�Y���[y8l����KH�����!!� !1 !^�)T����X�8t������M-;���f�1S�o��LB� �Z��%YV��	<Z�a�
|>�&�dh'x$xt�c�������	<���j���]-����vc�f�>���3��V���K�2u*��){��iX�X�x�lq�0�K����Y���B[X�-&�-V�-��Jh+����%P8�Xa8m:@<!��b��b��	���|:@�ixa�A��E��n��-�����ew����M�A��/^���,�f��L��L��uQu�����f��8�'�4��aC�j����E�6 �1�Ie�F	�y�k�������zaM�!N�l�w�2dQ�6����w�A��8,8F�hx��(��[)cWa��0�:��L��t��y6.,I�M��z�t#	�n?�C�{�x��AV�?���D���x!I=�H*��<����b9�"<��>�A�"��1�	�c�"1����b�Dr1��>����2N��3T4p7	���&���)<���7:FZ�H���Z��f���U���!�x��l:����c/����XX"�I��u�������������u��=}�����'����C�D�,��L|}I�}H'i\�r��KM���{x��U#�~_uP2�L�6��IS�Mc���n���6��4c���m�!/N��Q�{����v�v���Rz����������[��w�<�v�y5`;��X}8`I����}T#P�W�.86�F����Q\�P|$P\tO|�N�N\dM�L�L"P-q�*��'q%qP",�����J���D��%���&�"H2�n^�I��wb��8p',�:�?�����iE_~uFG�L)����R�H��3��a���(U�0U�(1U�0GU�8]U�(sU���!����T��E������t��������mDr
���J�[����e����s�$��XzT��Q��G%���8��f��c�4�U�,hh[V�#h��.�`7B[���:,O�eyJ4�S�[�����D�G>��BI�����{����RT8��A%�>��p+����7�F�
�f��!����c<w�-�{��8`�0V�.��j������nal��V

^���r���u���a���br�.q���38I#�C-�O���>�%,,}�HXXz������Ta�O�
K����Z�b2���KK��[�,������J2��w�c�3�����������������]���u�(d~���_w�B��I���~��Jb�m2lm��7����7�++��WZ�z�a�r�����T��V�;���n{���s�bn$���us����k�����|����������r_�;&H�y��c����V���p!HFV�/<���������Om��6�Ra��e	���g�b`��310�0��� �y2F)p�3����#���3����u�+io�\7�����1�[b3+�eK����n��n���[�r<c����S�'M
Wb��t+�Y>��KQ��SPEkTe�zW�F��!�9��Jy@�c��?_�y6�N��7a""�Vfu�����$/u�������
?� ������(�������b������\�_VPY�V/����E���t]��s���l5�o�4��[{��o������,A��O������z[������?���g��8]�
�LT��*S����T`e*�2H�
�L�����l!�{�����%�a���U���kO�@>��e��l6�������B��XT0	��?Q�����w�{�w���75e��B@�t�s���~�8���nvo�����;��:�!0PK�c�Y)�k:��,��]��t�j����K�4�y*Cv�6.�����fW���k~*��_5��V�����)#��L���
��������d%'6i	JS�4+����lp>'�'��I���A�SU����@��8��lt�;�v�4"��iD�y��������M3�W�f�������|������<�]|���f�q��8����U�9�����Cm�G���8����`��0.�f��
&T(�����<b�C<F�A/*@�\�����C���-�K~��ZG6���K�&���*"�]}Z�g���z����#5��p�iVT=�.�,Lz8z6��������������������q�����5�:��0>�RB��rF��J���N��R���V��[�i�5N��.@a�1�Z��[�b�\�b�]����;���k�v����iy�]�_�;���i����h,��f��'���N�&�q���G��vj����Cz�}`:{���������*�������d;;j��)��N�,�h<�A�?���k�yj��y���sj�N�?
N��.�OU�i����B��/5���9��X�K���P�K�����K��������������������������z
��Q 0Y'�GX:d�E�DKEm��-U���S�6�Z�.�[����Q�6�Rsk�c<q�=H�l"�F2k �d�4Ic94>�����,[�s���	����I�&��B~���
a�x��A�BG�&�a(�T�vG�hUh��F�B��5V���	�p�3V��q��&������������f��Q�����N<�W��X�������7I���]h �.|
!��,�BL�<A�VF���������8'�5�����Z����1��[���j[u��H	���xF��M�'=�^�g�.�.��|/���������Q(=�3�R�9[�=@z:<�']�a8��j���w�7�c����&���Z|�(�hrq(_��{y9&@����G�����q_�N�G��}������U	�G��Lv���0���H��V���X��o#+%9���a]W��X�Ek�X�1c!�B��H�_[�*���(4
xI	2*)��+��jA��2���B3d�ZK
A:�&�\H�UR?d��
��q6�CGp��
���*�Q�}5�� ������L?�
8$��A���eP@3( P
x��AA���b���E�P
E~<lvCIQK-��E�p���C4A��u�~�����9|�k�c�x�l�b���e��������]�r�/��Q?���*�>/�2!�IT��P���FJ��#�{Mh_������f�	5RU��) ��
�HVLuFt%X�����p5`�L�8}�������RZ��}���I������ FZ�x,�i��"Qm�����"iK�b3�Ft7g�h�����UO|L=��+c1
�/C�V�"c�VjSE"���������xN��sBM>��'D�Uq�E-~��E��-�>7�����Y���v�Wq%����'e����A]_��b���!���y��M^�V��$���� ����J�?���@!�	��"��C]	��Q�`DQ������-,*���
����O>���y	��������8�c0�;f1w��7*���H��)���{
�=}�j�|���=Zs���=��2D���x��`�����Lb�{�+�"^���k(	�������?�C�xJ/����X�>:��A]��eV�k1d�j�;�r���>�cA��n*Ut@�g�g���>3$P�9���2��??����rI���j�����H��A�|r���;�e�#��{re����v������Fp^����NNw�UO�$@b@2	��PL��R&g�����������)��B*�X�5d�����k4�3���g�B���+�)����r����&r�[
��:������4q9�7kmx�r��/�8�������z�y�F��xFbNOM�����>�>;*N�$1�'!.fi�Q.�O]�u%N\�v��y�P�����D�f�@�I(<zc��T���{�8#��gQn����z�����|�o>�6��������Q���i��^�X�Jo��!��~��	���C`P�.�X�Z6�rZcI�!���,q��]������-��T�8��9�Cs^���Uy���
G92��d<�	9�	9�	�1NI������y@v�f5�F�3������	��������pz������u��$\{
���hy	NM'��jQ�r��lz����!��7��]��#�F��x(m�d���Y8`�G��hM�u%����(&Z��u#�R������x�E�I��D�)�����H�u=��8�$����x�e<@3 P�x����@�x�b< )k����w��~]@
e}�g$��Z��u���*��Hy���������6<A��9-�����x,�b2�����)�4��6��b�u1�HC�"���tv�l������7��4�Or���Y@6o[�PP��:4I���t���7q�
�r��[w����,��:����������K��/O�O[�������>J�Yq������qo��j}'��U���fu���|9�+�%��5���I�E�k������]�oN�hK�y�U\�l]�pgB�	q&�\H����TN������R%�i4��PY�XLY�,$*�T���"�r��;�{����
��b�p�c�)�e6+�2��s���X���vX��kxc��{a�N�����'��_�x��]xac�GKeyLS;N��9��>���F�� ^�jAi����^�h�3�:�hH�[]3�wE�9�7�������B��e,�0��4?����@�O�6?1��qr�v1*�ZI���*�z+*�`��6�tv5�^����o���N�;M.aR��!�r�����}l���%�e�uV�!�N�-������3zN����oe\;����(c^��W�8���W�ZD�fK�x�(+�3OKY��x�����{�j�I�h�^������v�9��f>r�gBU�g�B#r(I�3��2�p��/K(��q���a7�<t�ez�Y*Q�W���<�C�Z�t�,
]���t.HY����?���A����!p��f	�wh�|T���Ul1�BW1������X2w�����4!�}WG�0�����=����\{�7���-�X��&/����2���G��P���W{������p��Hy4�o��f�e�11�sW1���3�����vsl��sjFT=�!����qW�����6�gI�%�W��]�� y= ��Hp����t�|�u�sz�*�5��(7��7R��&_�U+�������0��y<�����f@�1{�Y�u�0~����0K�7���jm�d�$�E����:���������U�Yc)�2��"��E��b�qV�o4��������"��R�El�uI_sk,Q����Ah��B����w�l�i/��r���
����������^f��ge\����ZEkk$����o�%�ydO���$����80���:K�+��m���PMC_n��(T����e���B�j���j2ht[�fe������r��%���g��|@~ �?P����c[�z��^��B��2:!�����:dQ�w.����)�����������9 mBe�L��Ku~�BeO�
0K3y����N}p��������8����OE�l�������3����x,:6���	S4=�����*b*��C[''@���5K�>T�U����������U�����"P��j�NFx��l�5��v�7e�����4/���%�~#?(N[�2��*�>��D���N���euhn��1�c�����1?��[F4=�C5�B���E��E,0�Gv9���B�����%�m �+}�1,����r��UY\�+������2�(sq����H����$�1���Z �zA��0o2���A&"~������Z��e���O���/Hq�����O���1�~�A����
�>Oc7�4�#~�1���}����[�c��������O=*���
S]���U���N����-��t���g���x����B;{��N��� �x��m��b�����_�+Xm7a�s�RN��6��aM��E�3n�P�����P����*:c���������8������
�w�[|���*b��B�&A��\J���#@_	{�2�e9�VAw�x2�����l0T�c
�by���Fb��I��4:�&yq�<�����<Yx�y�#u��xc��T����V�m�b���"|���Q�K����>���h�M�\~��t��[�j�<W���E�{8�fSw+�)fv�������<N>.��{;��-��[�\�(4|(�g�z/�I��T#����
�!@�@! H	�B@f! Hd� ������Q�~��il�w�/��N>=��
�0�n���&I�u���\:��7�7{~1��}pH�QJ5IF���cL���m�K�;�z�w{^s`��$c������0+�hEFB����@�sdt�����U�S@<���3�/��W�t�'�����]��tA�{��V�y�K
k��G�J>������v�<ozm>���:6�[����e[I���Z<��|�1��MF����C������R�n�ZO��t=�����I�z��$XO��'���p�A"�A�� �� HyP�������.<�Mf2xx��p�@c��?�~�V]������v�$�~��]
����Q���`���w�w����qV�+��NH���p��?f�I=:O����I�E���$x�P�<B�G0�A�#x$��#3����P�������4u7��<�,���U���TZ'"��.��c�6K�,si��*>���������c���J,_��Z�s)�����;�g�?OFeeg����5��[��er�jd����7o��;�N��i/�c'���� ��������Y�V�}��Iv�b��
�t����2`r�s�ku���u�*.���ik��� �|z��
�_��.�SM������2�!�gQ������#t{�Sot�����A���;o2�����>b�T�����ne����:�O��+��&�Y�A;�X��92��$���b��vLI}.��d����*�.����8�Nr� �x��x��x�\<H$$���%�{+A�E?y�0�qD�0�Z	+C)
M������],���8�$�B���?�g��H|q6��#j���>��u� �]�]�!H��!K�2A�����T[x��b���KA����dI������}��2�Y��U��'������#�(��"��q�H������/����Q������E�����7��j��?[����� Y�)���b5���d~��{f���gs��� _~�oq�o�0Y�N6n��Ge�\��,;�����p�d>��=����\���GQ��`~��#��qI���,4$�hH��I�����l����A"� �x�T<�*�-����%(�H��c��2�����!�����������Ti$��1�����m�9����;��z\0$[{���{��&_rm`
��uT�c�������}�2�����x���uZ�ekA@�4�8~x����s�G��R��w@6@c�������� �8����`"���D���Z5���x+����+`���\E���l��c�x��4F�$K�TX���T�X�P���2,��U�����Qu>M�.l��`�x��Ks<7�w��W<����E6�L8?)�B���V���L��w���6/��2m�����h�����`������'N��y�\��}��'51V��O�P�`t�;oP>m$���5�gPq�Lq��8��ojn7���#o8������F��z	|%QH��>���/���:l;�"���|�^�l�������7�_��6����t�p�5��I��m>�d��b�A���r�Z]sM.:W�SO�����M;�F��3��� ^�
`���\���0�s���&^bco�/pV���:�����s��
�l��w�E���,�'+-�:g��� �nCuw��R�����E��2�L'�-��TK�En��,��s����e�J��	C(�%�"�P��PJB	G(�J B��P�zs�]���y���&�b�x��M�&Y����D�l��"�a6�<�P�J�-��$��`eS�������y���c]K�.�����O����k1k��������i��#K�YD�����:������c��h�Z�@as,�����7�=���}����I<q4}�8�h�"�D��M,'9O��<Z��
�{�h���6*���2�5�{����R{��R{DA-�	����7�	��o���F��+@;]h�-5d?ey�	���co���(��1������j���r���}Tf�)^�2�1�����l���L�o��-�3%%����7����c��������X�Y�T�K�e������sq�,JXW-]��C�^��� _�|��$\M����H�F��?)_<vm0�d��>���jHIf=	7��IG�[D����\���>j�9�,CK,{o~�%�������f�����+�=��i������s�4_�������|�&�d��w������J|m1%��S����t��������� ~PE���*
W��T��,�����E���|�:z��/�J����b�eY����IUw��ik��7aq�����V����"��b�vV���v�F�pe�Y�lW�'"����e�$�iO��?�mN��;��n���io�>���7�X�i�"�J�b�*���e��5U[���,�,��[]>���_�.����<I����M��Me�*�`(���/���P�0K ���,R�99��������qU��k�k.�
c2�~~�4L�V�3x�O?uz$
�4�+$Z��@���4�S�f4����x2���|��c�3����,������t�P]�s���MW��I����a�g c��7���������Q��n/��{���9F��P�G}�_�6�k~M�,�\����p�y}�	��
2�2.��d;IJw�',���{�?�H�>�FW=����T/��������T��\�]����?T�Z9���Z����|/��V�J�
1��W�P���!�g�3o6w�	�\�*���l�[
�Lze�������y{��*c�~��D	�p�o��Q�/�*�w~��S����<�4�v*H�8�!���7Ha!���|��?]I����O�N��(����N��e�c'�eN�U������L+A����]��&;��I��5	�&���t;5�d��xj�.���:|<��qO����B=��o�WAl��W��3���>�����/.�\~�u�q��1'����TB�BT�uo7�<"L�5�>����.�4���@��4�v���tr�KM�I�������K��D��'`T�G5��A����.*A�J���hQ	\T2/*A��I�t���V��rQE@�������GC�HVC��
l���^�'�����<%�Mk�h�(e:S�>���X*o��*P::�R}d��!v��i���yaX�HQ���P�A�����n���%���p8t�g�Y�
�X��-�	�'|��0�����S��{9^�*�/R4�B���O��t��e�*�-�������a+���}��d�wj�*��)5���>���b���Y�A|��e���P	{�@�R�w�����\��B�5cUmc�\W Eb�������\'8prCo���h�I"h(M2KF� !i���Z�?}6�y] �M:���r��X�f�����P�������{��|c��:��%�����b���<����>.��D�����|��E����������y]]!��IG����v��[T���a/|�_���<X�
b����t�a��J�����
_���.)�V��Fh�X�H=A�T���=�o��y6��/������Z=��6��%�>�����dP��3��+��b.v�?o�W���}����"h�_,-��E��"h��n8��H��J�h��O���;m��Du�s%�hJ>���n��jwxI�S�C� <N�d<|�LS�{�C�y8��<�H�a���q��Oe��Jq`�h]{���T5�}���B.U��$����P9WI:���d�@�>0'X�0���5�>�9�	'_@�]6{$�=g���G��i�-��0��H]T��	��	��QC����10��n���Cc�O�������a|�WD���!n�@���FY�u�����c�{��]�&m���c[sA"��4~�8�c#��}��������Is��	��!��{5/t�K�i��9�}�(���y���q��3M�������|Z:���U�m�@�b���h�7�4
%R�8N�l&v�C�v��DK!M
!L����2R�^������i����E�Mu�9_D�[HW��j����8}���"�_��S��� �	��R~�AB�o�<�2�]�c[����vUhdA�Wl
*~4Q�3��N�������FEo��l[#�����Q�S.�c�J��\�	
U�o
�.1f���j+�)I����H�wC�L���xnR������H�d��4o~�ev2�*�P�������k��������^#�!9��:�cX\�	��2��2�gA��I�sg��xt9��L�����huc%�?x<*���lH��d,$us_���~��*�����3��=�x�K��[j�z�B�)��d�Q��1q���=��?V��(Ft��>8%[�E�������|�����vDQy�W����?������$������n��L��s&�	��TC���3���(�y��7����s�5�4�s����p�����������8��!2l�ps1!�]!R\�l\y������8��7�0����(H�x��tC�,MK���Rd����D�d��?�����#��F������,<��Pt���5?E��j��4�	g)2 ���)���B��P���O�7�a���2�oF/��7/�Fd;@�v���\yH��
�����^N�9��N����&9I��^�=�@����\]�������P~�
��A���4�����3IG�:o��-6�������I�����F��ci�+����y}��SWP.�2Ig2�C�Km�&�?*)����%�D���+`���\x���4��yxdg�����ot�rM���*����>�-��DeLR�>U6N�,��Y4|�&����M��k��D�k�eRR'�O}��&Z��x��i���@���v���������
��YkcZt2�,���K*���5�9��T]������:N�a���)i����h+����@��[:�������g�i�%�����+�
~��a��Y�#���'�|z���*�U4,�q=�����?t9���y�j���!!�_k�3�����<���!B���KL���2��m��R��\R)O�)Rg�R���X�LX�3)f8h��7�LM�.N�?�Y�t�4�T�4�l
>�f�����)��C Cu���|���9t������/���+����P�yXb��-'����@��T?��{�u�rl^�U��z-��i]��2��k��+*:�9M��b���ANs�k-�i�xGg8Q��F
��Wi��`y4^���:������<ek����������U'g��������;�g����s>������m=*�hnn�s�����PEL������`R��{.`���F����O�=�NnT�%D��-$�>!@�� ����c��UlV�8|��-��4zG>��h3�'�_�����j�}���-���c�e�o8gB��'E�R
^��E>������eb�LnY3����Y����������v'B1�]�N(~�����������*��qD+�����$�O[�����'���u�5rk*�������(��,9����1�t[�)�U$�V�|��o/_z��YZ�+�g3(�t��<[��~=�oD�3t|D��|��<}fO�Og�9@��5k�]o�k6w�Z:�4|��7����%�7U��1����>A�Q�P����z�����J���+�sO�g����Zym���F��pH���	
���qxeo�o�@?��7z]'�e��M������2��<��b5���r��^�#���Y�s������P���1����9St�����]T����;���K�����#]�G`$20u�y�7������A6���M���t<f�VrS���$��2��Q�[f���@�:�z-���o���������tf�'#�#_.�I��S�����c��6Dk.H�����������LF�q���$R�K��b���L�s������$R�? ��Ev_��
PK��6Gl�X?��_perf_reports/20GB_preload/ps_4_workers_20GB_preload_2.7_selectivity_8_task_queue_multiplier.txt�]{��4��Oa	U��I���D9�����%�"o��
�M�<�w<�;��yg7�x�Q��G{������3c�}�����>�i��sH�B^���`!1Mb�?���sb�����8����O
�~b1����ljsmb^LfK��h����v��[�)�Y�!7,���?ZVs�:�B
�;(Y"���1��\n��
u��7��a�
_�	�>|���>�����jv1��t�����B���?�J�U�b~
�Y�������\M��zd��At����Sc:]N�������>N������=n�� r���\�d��(�n��@��)?%�����!I�B�UH����[��m��h~5�����,�%|/�]���}�U��MY��xHB`���r����������
4��z���s��z����o�~�q�M��4]{�!I��/C���C��U�����#\�J"����Y�?��`#����>�W�v����0��[�#(n����\����)��Q�r�������;�����wf'd���n�<���4��~$/W���j�-���������x���05��������o\F�����W��W�����@?W��?;�..V3M_<��>K^�������+KV�Sm5��������B[M��Y���T�{Ab9lD{+IAZ�K��� 63�0�lv4�Y1�*�z�&;+���z)�ds�������[g1�
�����j���vgL����k�e��<����wF�P�0gX_3����&n�qY����K�~����o�����U�
�R� HL�Z8�n��,
Z�
C�� x��j���+�x��]S�����o�M|�����9��T���6���D�
��D��1��l6��,��a�F����Y��%��g����	�c��9~w#��Vv�-��q,��+!����K����$��+u�����no�_Y�o[�����2�"��P2��5�#��,��/�`�����`&5�/g�d��?����FqY�>l\B<o�P��WXN`���E�mQ�N���6��W&��B����r�mm}�}����~�HD<f� Hm
����ua+��SW�������kh�IG�o�b8�.���Q����M����&3m�y\�����4s�f2������6�����kk�����0��S�c��E�:�=&:������I�����a�p���c����~��k;�7��s?O;1�L���JW��,�<Cgs�{�j���G�1���&Kzb������}L�k[�O�����g&���11;������q?-�e�6��=�{�Mtm����{�mum>����qK���\WWK�	v]�����"�7��ZC(y��n����so�H��}��un�o3�n���
:B�����po�h�/S�N �<k��p��rt� ��A��K��n��h�:BE��i�qC���@���$�~����6��%�A(y���E���x��j�&BIEG/��oib��W0T
������'�����Op������K������u�G F���M������!T�R����n�F��� ���������ZD�&��Gm��>��
ZfC�_��	������r������%�5^$A��������Y��EP`[�n��+F9BEA�Rx�����
k�)�A����3~R��-���m8A�/j���PAL�
���6�p"M��
����/��C�@P �u��t{I���
�D��L�&8����'�>������w�k{n����B�b�2��#j�93�yd������M��5��(H�[H�KC�M�=�5s$ �F,,���%L��e�y�%�B�f��&S+nK+�M�P�@O��^��y�C�[|-�
8Y�*ZhC���������*u��.E�M�d2".��@7���K��z��6�����W~A>�E&��G�g`�K�����u\�v�����;O a�����x�>�����J�Vi!_!TT�
�������F�Pi (0���}J[���@(H�������5��54����zI��
��a2�h�#���1t�S/B��Y��[�L�+Fu""I��.pF��%�	zD����~�8u�����=s	+�����f6?�Ax��UX
���%<�!��0�=N��0�����%��
�|�V�����>(B�m�4�����T��|����[��B���B��D�����h��L��
�I�x*7T24K�����R���v��B@�f+��]��K�)8��>�T�5�;��K�[�@(H��v�1lY�����D�.N��0BEk�i�[��K���ip��2��qFp�b�V�O��������&�0� �tFDB���-�x��aa��g�B�
v�i3����e�D���'�g4J\0����bC���@�\
��FI�h�#����k:~��Q������4�4�CHCvR��<�\LlS������(����TQ�Sc�V���T�43C�y��t2e�(�����^O�L�`k}�QF���VEa����3���*�T���A���4�[�X�S��de�(�r���C(��4s�]#h��g�y����/|������|�`3"9��_������'��6�k'O�D��B�
mF��&�����>i"(P�l���7�3Znl�;�B�e�	yx�yv���u|�`2���������0L�
J#��^��A����g<����J�
�6�0X�i2�����fdw���6BEf�_��l���lj�+�L���^�?�sr����^��b�t���A���k�E�\a��C#�@P`��
�N��K=��o�C*�0{��>�
������b�������m.�qWY�A�B���c\Z��a|��%���!8U8�jt3��.�~��rA�GQ&Gpb�4��� �#�4a'������S�g��vx�2()@�$lJUY=/��z,�>*i�p�&�� ���?�#�9����]w��He�v�@��U��ar8�J��9�[�b�8��A8�����>�3SDW��n��[�i��|7�n&�n&�n&�n�>���J?O��9����5�����KN?7�\�G0n�wc���]&�Q2�T���7��D3[��e��c�V��Br�As3b%��5n�b��=
6cn���xA�WaW_�&�K�2YeGI�F;h#��Z�-S���Yt�0� `z�'��������xy;�"8�Yw�W��k�����c��&q#u��fzu�S��b����k������A����������-���j(�*_������ck��*Wwx�a�`���{�J���h�z�4[�R/o�X���$j"�"Qc+�*XJTiU��j�V��@����53��)�������Rde�!�H�D����an��!E�����U�5;-s�,���=K�z!>��a���7�R=���L�����'�.�)4���&����`u��[����|q	)��:p�����;��'����������&yP��&lJ~)8�i
�bX�fI^�������4�4V���V�,����T� (�&�&|P��\�O�qd�t����C��8H�|.r�r�Y�>��M*��-`~+{�NU��]���`�R�����v�����mT�;�e����|T�[e�����[F��A�[���C�N��^�l
�\�cDM�Z��k��c�@3f��q#i��9�����Q�[��>���^�b�0u>�/)��,�^z4n���*^�l�!_k��U,T�l��0��9�Ez���!�!'N�:�"m�
�)�K�^�b
)@%�����asYO��{Z
K�-+yH��68
;F*"+Y�a��"I��E����D���7-[w*-��$"	$�������"tu��_	���=���?k�J��N-��������E\�
��D�B�5<\@	b�,�'�������=o�V�J��;0rK���VB����j\�������4�w�L�]%��q���B�A���qV���x0�����;��W����-0�b
�[H�h#`s�f����-:w������_��6�0����;���!��L�����8�����$�sw�e��dwt�1��;'hw?�wQ(h
���'���.�5*���f?��/�CfYB��bPx�B��@B��m�H��h (l>I�~1MA=���`��0b�3$��_dP��E�����s`����4f�Oie�a�Y	�{�OX���MA��MA��MA��MA��MA��Gp�W}d^U4F
�bob+���"�62R�+���J��Pd��]��=����IpMO���H6=A4��:z� �fl+ZI��(w��0m�.���,W�SJz�����i��d�a
d�l)zW���!����o��
?g;���V�Gk����G��;���pTO�u�(y`���k�p�ji���$pBy�*)��Ho�&A����>�3ls�9�p��=��f���9�yCq�Z�v�(��:�!/���~�Iv#�!�����1��g���F������w�ec`��F�Y��\�����{
��=���|�5�>+zl����
%l��p�N�_=��a���P�����x�������Hwv�xvYr<�#�[�#m<��x^������~�u���ehMaL��1=�p���`=���^r!�2����I�1��=<^5��R�����Cf��=����<��i����Fm������l�JZ%������B�����M%U!���	�/��'�@����@J�E�?����:���%�1{
k�N��"�;|xN44_������P��GTPA|�B��??��k5"?��Tb�K\.x���o�Apj �W�|�2-���j����������0(����������:
���z)�����)�}Pz�������j�?"��q=.��� �6��G��B�=�_���	G����0l!�4W[���X�e�<��#���,��W8H>�a%�g��$�m'���r�����jG�(t�8{J�V���B���.l'�/u���������{GR�����Y�GC��;��H�}:^(v��l�i�NeGr~�_�����1��T���Sx�8e�r���(2�]FUtk��j��qu?��������k��7��R�}���<�~�y-�=�7T�����R�r��]f/���C�C�o]1��fD������p�^�M!W\��`-Dw��gbt�+��� ����o"Q��|V�W0t�
t��%��m��N�p2���:e!���:��|'O>�7�f,��^�`�#9�e���������n��c���
����E������W��d�Fx����\�hl���>1,�|���-/�<��P
���*��/�
>���Xe����#���:t������2J������������_Tfo���S�mS
�p�
��gx�\
���@x��1�����{��m����g���N{YV�U�D������St|u�4u�W��j!�M��Bo�����2Hw�<��4���/����������x����8��U:���it������]��`����'����c�pw�v(&�0U��y�+�9(���nC��<�1���[7���+%�������4��(��Y "��#t�����J�&!s�D����E:$�{4�B�{��${j��)]���J}�^�0�hz���(�w:-OR����(�����59�s*��{�%#�*�%�	�n�9^�
���'n���5M��A\#(y�JP���x���0�E|Zd%�J��V3�+(B��~;�7�����ln�bE���<D��<3�a�p$z]�����e��b���n\��w�d������������?8���)H�r���*Kge������H������{�����_�d�nw���Az�g�����C�w�d�@���������C'>�PK��6G�U��Qg�_perf_reports/20GB_preload/ps_4_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt�]{��4��O1:�mH�4mOB��x������6Mzy��������n���>�[�?v���8��of<c�m�`u��6d���j1��|?t���!��p�d0���X��j�2�x�B~	����Z"m���{��l0{��=Q���8N�K.D=�kQ7yU�,�M�yw�5�:���
e�-��Y�%>��1�N�b��2M�_�,[Q���{���J���w?����(���������,+�S���(;��j>�G��#��VmZ�\���V�>�� ��\},����6��5���y��z�ZdU=��8[�h���l���� ��
�<��v!���!��Y~!?�l��_��Wq�	>���]����*j��E'���1,�uy#�|���O���0�ol��~����'�_���{�������U��N��:��/�o`�7�"������o>�U�r$~y���-O`�l�i�,;�,���!�:��n�X�������(��*:���.E:���|�������2�����w��@��������.��t�"��{�{��Q�`��yQ��W9������hn��o����,�]�i������yk��_{�������nnKo?��uv6�x?BX?�J���|�&YW�'��m�V���J$��D�U��i�V�(��eZf�d�V$�'J&I#�i9Z�i�!�Hi9Q������U^�"���kn�D�=]��$K������I�h1X��x���3�@|�>�������k��8��?�@|8<�k���`���3� ����_����������J�����
K��������lDq�_G�D�A����������k��MQ��T��z��2�n�����iSsG�,�2)�l�6�����T������'�_(Na8���$J[��1��?����I��M�ReI-f��\�e��5N5Lv��Nt���Q���_�-j����������2m.��lA�2o/��uZtJ��u�	:<�x��]��t������$�&���0q�������G�N�����d��b��RhN�O�G8�79r��M��rM�K���Oy{�G����2����%��4?!�{eOK�6�N��r��z��pd0�����+)�����2�����K�{K�v\E��0s�C�����:��4���^�������[�o����O��l�M&\��c��-��RY��%����c[���+��@�
��
 u��-3�\�m.���6:K�o[�$���]$f>Hg����<_d2b���}N��t6M�\y��,*�k����MT�T�)�0��/�������"9�f3Q�Y�5���d�	�:�S��2��z����V��l��:}����I�D�����R�q�������h�H��"
S\��2�4�$����l����
�i#�bb�(�xD8�uR<��E�)� r!���~d��|�Z��v��5�0�F���b�Q���k��<��qp��Y��/h�!>q������c�G'�&o��v�hZi7^��ve��j�m3_$��1��3�Q[Q���O�
>�qD�{~@����Y�~k�`),��%T�7�!�
z#���
�v�]�����7Q�?��v39x�mF~����������
Q��|T�����&{/5����T�XZ,1���~�Si]cZx&�F4������Q��c�':�M/kcB��E���^��0��}�{����xu�U(�z�����m,� c�����A&�~��B?��-��mviB�]W��o4���Bu�y�]�1,c(X�P���`C�`(X�P0��wi�E9�m`������U���c/'�B�\�@��/���b����I��%F�z��
3��Z5`����1Z���bi�"��R�=$'�F25�THF*b�u���l
��kmb�(UGpr,��:�h�Q`�,Fd.2��L�Ifq��q������6�� �	l�qA:Jk����T�\
��B��Q2��U����e�69k���(p������ZG!�AE�YW�����E����s��U���]������O�XN?�3�� s����Hm�
�2������AKwkK��<�$�-cli�a�A�Ec��I���-i�H���Z��a��������G���c��I�pl�'�\��9��l�e~5Z��c����	�*�j�����9w��%�{f��������=����4��.��;�u;�	{	Q�C����������l�l��^��`��MV-)��j�}:�=$���m���OWA��.���yL�8�@S��)���c
�c
�1]]��s}.�l��K�����>w��G��;��P��H
�d�_���N@�ut��!i����
5;7�_��wf:�����jq��>k��__s+���b���Q�ljiX]:��-�a��-4=]�����CH6X�pSK�#�^�	8x(,N��4
�P�����������x��X�9sh)kf����5���-~���������l��o
l �%���*��|��h����%|Z��-6�LvV�f�VS���?y�����[+���m���HC����s�]`�����������3���u&
m�v���w�3;l�
���Tqo��g�/�6b��4�];�@���wv��C�]�C~lSU�>_w��yK��~n���l��p�t.�lZ�d��x������6���	��qa�O�[r������������n ��A�+���6�=K�Xp�������`	t��
������'���h���d|'��VB���
Sz*'����Yy�)�e{�:�&��N��c�����4J�`��[��������t���W�TL�i4TQ�c	�I�vTA�;�K
z~(9���q�\z�Pxn����s��NT�������+���4I���Ng���lZ2`���V����j	�����������i*��Gx���1j��h
��4��4��:xJ�nB��������a������}��;_�=���E�5��|T�<	�Q����4e����R�����J)��U}�{���8��i���o�[������;g2�/�x�:����j�B��C�����Y.�o~z�ty����t6
c.�_��D����[��7l��?$�OV(�_���~2�W�E�z5)�:���b�B��cW�1��F�]#m�4+�Hp-���9���[�aJ��_m����f�Y8�?� :Nd���i���t
��4�e4���@6��ip�i�	��b����� ��pVq�����D�7�&�������H����BqM���Z��i���K��o��Ns���$w�`�j�]:�S��^8|�q"�����w�����T��>xc#��,z���X�Q�&��A�dA�A�$��7Z��%<1
H�D��@gUaWp�aW<0+��e���j����3$1S|1��1 XdK�eV&���X� ���zS zk�fC�R�&F��BDOn������������WnU�#
5��#o4|��	4���`���"�
��n��!p����R;���@"�z��,;��	p--�0��h��%�.��U�]���E�XF�G�������t�D�����kB��������r��l��L����%�#�'�0���){���W�w\��k�%�
S-� -��@x��i�yJt��n��p�J�\�������T��7R�����y�dUaWp�aW<0�v�\:i)������&�n�l����lOV��������o
l@�%�+v�\a2]I�Xvs��^�Dc�R[v6 S�V"��uv�����D�p�Q���z���{v���K�%6c�n��������W���
r�~�S5����kl�����f94m-Jm#`m)vm�6g��n[:QL�D01A�H]#���XHm)4=��-�F���0���m3��fG���.1�H
���g\|��G���w��A��%.��	<��x��^v:!j��9(�l4��k���g
?���AZ�>^n�B�
��������3�v��i9S��6������+
t��`���F�dk�nPA�VjJ�wO����{��2�:�x����P�r��HW�m��Cg������H��C�9�Y��6��6%�^�%�,h��M.O=Mf��O���t6��7P��������U�!�{��k3Q�����UA�5�A������|�E��?����k m�k��RM"���]��H�1:�}��/�9�����r��-�,�Fs�.��l������z����P�CU6����.C/�u�8<�Y���`Z���.�a>��x���d��m���)���k�}�w?�\<J����/�7��@����a4���@����k���A���eg~��Y&|��D*R|�A�u���E2Xb#��$�E�L�y0�jOCPW{�����4�>M�[\���,Nn������i6�����*�]f���K�[T����I$�~NJ����XrH�
k���0������V�e���Lv��]@��c�x�$�u�������!�#�hh��$kZP���e{���d��$�3{��g���am:In���zM��>�����L��������R���p90 3F�X��7����|r��=2�|�� ����0��LFhX�q��44';����1��g���4�|W7t��}�������>i�s�����=��R5xtK9�n�w;���@��'����9�4��f8�d_��v�=p�3N���.:�[��G#�3�%�Y��a�,�gK���~���|�^�Q�e��av!�s��-���,��q��!���
���,w��nQ~�4����tz���F{��w�����r��"#JZa��XD�����5,�����Y4����p8�["}����|�����]���
{�h75��@WvU��
{Eg7p��� Uiv]��\�K(HO�KW�����Po4�Fq��I��k�^����Q,sU������T����������z����`�����~���?}�����{��S�v��?��T���E��J57mK��zM���]�P�Y;x]��u�&	���+<� > ?�?�>���tP=��_���a�������������}��s�W����7o�:�2tz3����!m��i��frM���m3�SO�����bS=t�����1�!�����k�(N���`��,����E�0#b�	0���+���o�d.�^ ���C%=B~����x��|����0C�CR�Iz�]��Mm�L4�NLA��0A)��C	U�����b������f�K�lqY};`;C�g`�V�������0vD��]���|O�������a��\�j�(c���h%�(	.�N��;�vp��8����P&
����D�
�4��������g^����/j�L�)##N 
>�����
2���8��r�� �y���@�����@��@��$�pw�����<{�+��m��`���A�U��g����
�	J�4�8D�:6���@(a�l(-yL��t>P?������������|'��'�O{�,O����19�%��d��d��g,��C����d�F|l>��.�5�A���GH/���y:m<��zs�m������4���9��'Oe��� �]	�e
�/���u�^��>�#]���x�������|W�U��@��B����-<ED�CNd�o��������PYX,\:�p�7{���&�/�~���v�N�54���Jn�|�������7�F��_��*o��0xh����Z~#L%�����dA�_�#^N�s� ��g2����z&y��cd���gS��Y���3	'�I<�^2�L�`>�I��
�(��1��p���$Hw5�Lf��L���;�j��I]O3�h�X�������a8�i��o��1�
��Z�G=�}�1��k�]�S���L�l��|�?�k��&�l�_�6����1�:V��w����}G>��
�hO3��HK�}:Zhy=�=jf~G$yz�k�0����h���i��P;hl��f?��^�,�����A�����9��'�S���������j&��>��E����e�4���N��Z��w�X��|�@gA&=��z�HY+%�5���Io�T�F�\)m��LWd�� 5u)��U�jX�*[��]�`<��y�Q����z"��Z�@�^����M��I�0J�=$KHyHF_zHV��(�2�vCZ���h�I����A�"m���)�<r���a��ES>���a2/�1�1��g�����2cS��t��m�����|��Dv��,O�����X/=�k��P���5��P���P�����6O��=�B���L�� �t���E�`��ng@��a�K`��^}���x;����VU�;U�L�./������A���S�n����t��K��k8�����]�C]����v%?]a�|'�����'[��~��8��&� �Se���U�����.8������[���u�{�(v�t����������������~��gw��G����=e����1d�3��1sL�$5r�S����*��e��C-q�Z9��PK'��^f�?������o����;C����l������9������b���T�J�A�T�K�%
�*� �c�z���x��*�����$tJ�J�@���J�@���J�@��K�@7Z�&P�+�N	trEz��v�qP�OeJy�	)4hrh��:�Q�&g��:��j���N�:�)(��������x���l��}��^�����������uC1Y���:��_��/3���W�s��V�]e*U�G��SCc��$[7J�DJ���G�����M�r���!(�1CX�Ze9f�p�Z]C���7[iK45��H;�,HE������"4��e�ba�X).������"(��
�1�e�bg*��q�k���6% ���w,�*���V���@�$M��8v	��n��y�F��a�f�<�y�"X�y��������&������7;�_����mh��0G9�����q�B8�d��Y����A@��
 |�?���&D�����v ���#�I��/Kf�J�N�H���!|4�#�N��j�Y4�8�w>_�st��>B�5���/��'Q�/�c���0���Vm��������Et:��������O|�ed-���?1~TH���O�6���EO���<�����C�m����g	q�'M����W������r� o�4x}�Uh���=�qw���]Y������1�����,�GXb-��U��'D,* ��
�E��9����YN�r��q_��
���r 4�8�5�y��9W��6�K��)-�7�M�~uW��rU�
�1�2a���ai��!� �:bk&������d�G�Gl���zJ)���51L�h�,�����1��i�8���8�[�Ye����B��2�.�������U�-%n���A�V��5X�J����	��<�i2^[r���lw����p� �;�� ���V�O��d����e�v���b3��A�V��$����wq�q�/�V��sL~�/���][��0�+~�5��� n�S�z`0�C�q���$���[Fl'���tN�������N��������O��j�,a4P���C�D���<��#�"R9�+��**��@�
n�uR�H*��{���3��������������"g�O[�p�"���E���8]T"N,N!4~f!�����(��y4�q.����0?�`�%F��qksX��6����|���;�����;��;�:
���:�:���:
����?��n{���@rC���hp?=��z�q
4Q!��2�d�������&��G����R@�)�WcFP��\�q
��r�a$����
#����3���&�s5����\��A	�JpfP�;�\���f��+.�]q���V\p�����$G�H�x�$P�x�hlHy�A��@�h�����z8\�7�op(��V���x��7�
���m���2_���*c��J�Pz�z�������!Y������ a��'��a�^8QG������1��HEC8qf�	�vQ4�9h(�5��Phc����q��B����B#�Z�Y�_���W�?���A��W��7�47kY�	g�����jg!�2L&���P5m�;�O���]g���w����PO)e���_�7�{#n���v���4���{�Xw�#c�7�u�~/������)���t]�n���a<c��1������[L-�[�
�����&��w������R�D�Wz��D *D�(��z��?>�rr~���f)=/��u>P��@����Y��HY�`�V@a���a�$��b�/vPc�:����`��P6`e�[d\�.���
�z=�R�z�V����Z�!��=&@ke{|7����������N
�*��W��Z�������T�������O��
���wK7#����Ob�~����$D����.����<��c�17f����Y������a�f����5�E���d�J�����'�#����P^=-	tJ������t�
�n���.@G} A=P����R����:�f�@��EG����<C�m�;��{��^��H�#��2�^�8��~}_-g2*���H��o�<~����4��w��w�z0�&���F���V���i�O�rF���K��SX���x�I�Y����U�be<���j��q4�Ki��1��N���pwa"��
��c�����wl���uu+�|�:
��Rj�]�����������E��1��!r�����qj����}���s�+�D�X�~��N���t�dZY�UFN�|���EXk����`,9K��������������^=BE�����@�dH8I�e��L�����
B�>X{��X�F�5��j�����4���B�pJ��$�zD�+��kIx����Y��7!����}!�:����}A��.���8��nX�S���<8jke1i�f6_O$���>�jP'mD���Mw�3��&��$�KL^*�������S	�7Ki�4���j�X?k^�ky~�~Eg�^�S��/�k�]��,7�`�N��\~���b�)��Y�����TW�uK�b)I�t���[0�JF��u���]�H�����`�Y9�3p�P���B�!��R�L
�������0W_���k\��xa>���R@+��P�
(��
(��I@)�`P@	(h�b�PY�`��bK��*�B ��8��Z�(Nl�>��'�v�O�X�t�J��A�I���K*;��;4S0;=�H��{.���X���M��C3_�j��YD����+��d�:cY^wGg`���r{��5 ����������P����rg$�=;L
::lm+qM�5��Y��uHg`���<i�����U���n
;�t�~wKa���Y�a�:��Di�&��b���o��:}��PM���w��j��P���>Zt����>����X�����������S:�R�}"�h�[eu���"I��+��-���rh-�'�n#�A~�:t<V�4��Q4�
�z��x�u�|B�V�����I5��u*03�30D�.ki�v@�zZ��c��0tDg�(�rtW-a21������C:�tn�m�N�,�Is;8;v4t��.WRf���n��~��:�3��T�-&����}�.�����-��oIN�UP9X��� �V�\$����
�7xC�������� 		 ��B(!��X`��|�n 0p�W>�����X�9�+s���*��h����Zp�	��tf��pn�Mf���V��\w7��5�X�@
3g��9��u�e����-�ko�4=�A�LC� n�h^��(u���s���6���-JE�1�����������������������,�G���M-}u���FB��#��:�c���o�{$k�^##:?���b�v*��0�SK��g����|s3�RxT~�t�.:;l_4v�����M�*nF����ij�������^�����z��D��Q�+a�+@���R$Q�|�Y(%	�8����!�b��l��5t[2
��#�>�i��f`<�qh��p]�?Q��#x��'�;A�7�]`b3����u}�D�*0�O�{�9r�Bk�G=@���8n�h<���q���]�*:7z:4��%��<�*w���q��p�A�]H7�"�Kn2 ���<`ec,ni�Ka�Ux���E���K�E���Q�h-!����@H� $����,��z]=��@U���.b���������<CW�j6�������]�J��#��������Px�
�RXz/�<�[��@��+��k��Quh|��V�qW�&�X�D�#N/ y^j�����Bd�N�%&6qh���o�����L�f����p��s���9&��l�K����w
��
���	cW~��w�H���}}�$�F�����J{����@�2:���to�z��v�PE�|���2"6g��wd��hl�~x�t��&�����:s5����rL���������P-���R��oo�6O'���5� �C���V���T��=dE�M�:\����-Z#g�jeM�����I��/�bSk��%�~�����c�Qt��S&�H|�/Yx���_af.�c;��-�C.�=���)���,iN\7_��TK�BV�������~�p)y#�A7p�������eqm��H�]_%�'��Sw�W�'�����\���t5�E�p��jE���ez_>,�����~�#:V<=�/�u�~8��t^�%4U�W���_�V��2�30P��f��u"����bh�C���V�Hd���/��U+^.���zG��������n���c���i�������Y	�}���y"�{w��7��X�(�����+��o'�7]���E����1b3�"qn�|�s>>��ct�M��/x^�~�w-M��0����W ���G�>fw�u�8��NB�l������HH[���L�u���E��dY���^�|�\����H�Vh	�Vh��Vh	�V�Ti�9���j@M5�N�����W/��7y��R��C>�5����m��~*�v�w�g#���7]�-N?P�Np��t2ETs��i~�U��"nWS�L�|���9���<Xc~��\����;���.#�m!������
P���M"e>�|�.��X��4|�\�.�j"K�TwEo�\�^i��)� Y���#?��|�Po�e'Y�	aO|�����l1���@���X}����[,���$�#����	~�V� F��	4p
H��FR�����h@j4�4�@	PPPm�� 6��<������7e����������u>��|\�"5�N��(�����d����g�8
���XT'��Gg,�3wqY5��/x�\L#W�,W|d��b��g,��
M�8�MOVk,�nT ���x�� �J���t�����z�tU��}R��%�9��h��@�������9m^��.�'��H���w.j'��[�N����~^�O�R��������I/��\��3��)�t�tM�RU?�TU������(4hx
�A��@���[�� C ���=u����p�����v�{|����W����H���(��L �k1I�{R|�����g����\���*8)��|�g����M��C��b�V���?���}��W�;�;��j���
~��>���:����w�Y���qF�9�j�i�������\�m��{:��N�/��.�b�����}����Qe��o�����J�Q�%�)(E�K>=��
��
N������h��_��%C�����q�'�6�T�p�$��;@���4��;@�����;P�P��h|
'�Kh��&E%����C>I���*:��D!�����M�~M��CR��p$�(#|�:����"U��nI'8`���1f�E�r�K����{|=t\�H�1q�5:�f�����l��?�c2����\��Zo���t�h��3[�=C�Q�z��-�fl����7�	�Iq����j�X�M�<7b��6U�{4����{��}��\���c�n�Ht���k�
	4�"��E"�����7g{���1���P�I�]B�P<���$�{��8�~�x���wzsi��v��c_�S��8a�������B*,t��CH�$t@
@AF�c?���'S����_�!��A���������Y<a������C���=!��G��An��O��J�	�+�����BW�����?���������Cn�S���������(�'��|+GI�w>c�e|�����,���������y�|:���:W(UO�B��e�����i[.���[^�6K�S>�zww��Y����'^���wZN|w���:��.3��
r�C���v95�O�#J�I�
����p�l�D�-�X�f����f��(5��:�K��U�W�A�f���(U^�
��z��z��z]=������@W �P���1v�h�|�".�(�H�"�]�x���l<���=E@+
�X\"��C��@�9�y�>�l�q}����}�{0��tD�q`�*��(�jbM%�>��bM�Pbg~����g���* v�g|��O��p�3V
�u���j����I�/�$b����%���u>�5'�c�g���=tD������I�p��;���Xj���o�����������UaTZK%?���C(�5>Dev��i�o��#���$uu���,J�)�`�������w��xa�$�\�^h�e9���)/X:+S&bT�o�����3�
<���l���L\'�V���zN��2qWi�O�;�g��:�rY�0�`\\��y�\�*����/���%�e67��
��-K��
�����&��l�K�;k����u�Cn[�n��?�"���|�	��� g�D�Iv
�����l\Q�����u�*e��o����l�z�]���]O��/\0��'>�D�\��Od�'��:fZ�b�����1��]��k�|����C����=9�Q|D���O���:?��=E�K�bg
1U�����g�d�I�$�����-��X�M���DG0'� �D�AnIY�����%lQcK�����E�-��[z������������(�"N�{c��c�-O�uc��I�.Wk^���_��z���o��z�y`�&�q����z#|b�[�M��@M
�>c�&����<�GR��������H=��]r�7����^5�S�1��|����L$��<�b�M��qmn%��~�D��������$����$<��������={sF�Z�ci5<EK}��[P�w~���"�����49�7�c ��}?�6�RG�g�O'�>C�@����i�`-b�6p�+�G7`^��E2ns#�I�]�yB���x��u�Y�#[c��6���Iv(@N��O'!���z��W�$��-��J�E�����q�pOe�m�]��)-�6�C( 
X���E(�
H�
��G(�
H����BhK�ea�{=3�t�QB�3>��>sE������5���T�X�
g���$0L���$��p�7i��������S%�U=�cb���f&�&-��|��a�����L�����WOyvz�"�(��\�1�����;)��\|���a�rm-=j�dV�e�sy4���c���X�T
�������p��Gq�o�������G�\-M7�O�Z����~�z�,���:������Z`�����������9�Q��;!����c ������<�z�������Qd��3�i���2)�4�i��86d�4�v	�x	�$�l���.s=���9��>"E�G�|�m�Xv��v���<�D{�G�q��(�N*�'��
�I��iR9���T@M��iML��k� ��Dk�|� ��3d�Vl�@�b�.KO���;H��C!�����n��4z�=P��F��qx{Tx�ap�0�����"<���Q����@\���|Fo��x+fN�%�����c��F����L&B��ceC�#��c��]pB6�0Y��n�%*�'�`t����W�<}U���8!G�y�}"n��{�.X�}_�!��[n�J�^�oyRU1+yd����IA��s D�Wy����s(�����'Y\��9^�G��I-|�wk�7�zU�[�_����������ACx�u���-m�TL���9�d�|����(����"�?�����y!T��RE��|�H�J���lq

e��Ct����YI��]��2)���
��k\%+�c(^���gd�����-2����'������@�����@����A}7�9��}R��aO
q���kq6�������������~o���T�<��|W�rJ�J/�ze�J�����IM�gagagaca@ma�na�fa::����y�
��c�D�zjMB�[n�����d���^S��@@~ �?PHs�0<k��
Cu�E@5dy�\�V�e�i0a�,�OK�sdtE9	��@M�'H�ps��R7`�4{��#�&w|����0E�'�W$�a=�s��'����I"�����.��S�M`w��x��D���s1��W������vI}����`-x}f���#�k���]�"�y���X���A�8��y
��|H�������^#w��;/�����o�|����SK{��;�����m�[����nh��C
��gA@# ����	d_����	b���7��]�l��=L����g�5����	�Y�����"�a�Nkrd���7_�J1NTu��g� �I�3��B�d���T���8����'b����P�5�3�ot�Ke��A�����LN��lc�ladL�ma���<�^'����;�?�G��6�F��^>9q���i&�;kpt

�����^��� ?O�z��Re|��TR��h��	"[#�aB��2�U�0{Q_z��������8��8YN�kyv���7O�A��dy������j��G��Q�����eD��,}sX�{
��\�c��U��������T��=��|lW�g��!f��O�.������Y������\�y��o�4YG�wg�����C���}�^�5t�h����"�HC�#���x�GE��[l�+�mS��Y��ae��67�����'q��g�m��]�n�6��y�c�s�c�(�m��*�m�kK�$���_R�l](��l�X�P,*j"�g������y��&y6��;���(���l'�����Z���js�amt�6��m-��[�����gb`?t��5r5�����F@\#`wlU~^���X]D�.�r>���>���R�"�p)�/v�k��<!t7������M�����8)T���4y>?��[�h����4���e����>-�y�>X^�c�:-V��xs���n�����N�� �����V�!�:����+�y���*D�#w<bQ�?��H�"R�+���oV�X�\=�c��<�NI.��NTn��@_�L�Gjg�75]�sHx93�q�3-�0�Y�G��q'�������C�%�
�>��,�F0��k�D<�"����E������'�K�k~=�4h=w�N���a�����czfs�-�2�_������(���lrf3�%�����+�U~g7��Q��KG�`w����k��[�,��Ah3G,�\-��A,� �E�"�K5�<��1g�I������"����� P��XQ!n�4�Y��x���
_��[�Z��c*�P37��Q�0k�L�;K�����n�?�g���FU�
wU��s�iU��<�p��B�z:YI1�����������0�v�S�EW���A����Zu�R�f��$k����n�)�m�q�x������OA�X���
Y^�:�\���e<FA3�����^c)�K�b��b�\V/o�'�[�����y�1��u������j����W"���%�}������B-e[F�������"��z��lC�zl�'H��$��=B�6��Kv�`�wYmay�����-v�������)z#;$IW4&�`����yu85k
�U��P�Adu�m{���)�Ok��`�X�U}�Z��A��!@�&�B��&>��m)\��@H���O�yI,�b%m�T�)Y���N�C���������������'�[:��r�|�\.����Cx��Tr}}u}#� �*�S����J��N�{���:6��q��F�~�q��x��w���s�X�{������/$�ja��h��Q&=S��I+����Q���fs��Y������9/=�5����\��@��!&�Z��?I!N(����M��
�O�����e����$�,�'�
�Y�� x\��
6�`�������d�;�q���O���:��-���_a���5W?W�����	��
p��X'.���2� @E��N��"����JY�>Av�G2�NF?,y0h���x��|L�cpJl6{y�0���P(-\������i�Q�@�%*lO�so�O��jA������g����.D>��o�����)[mD���QWX_��3\������"H�Xk��=m���cn���2�]�T���-Z��n�Z�	����=��s�j��^����CZZ�J�4��������^����{v���
�z��	�X�f��m�/��7;��<>�!��E������n!N�s,��A�����4��*0s��Bk����/�.�8���c���\5'�A��rC��!9"� ��b!�@��9!G	Yz8�*;��,�v�0�I>T'������f-�1���{�$.q���?@��0���:��qS��At�n5l�QG@6���������s�+G���FA�>�Y~)�������0���"���
j���P��efK��:+�E~�Zb��f��C��qA_�r��+k�b�=�K�����
�#%�>��cw7Ay�G�sK�����b�G?�����a^v�$�~n�q�����s,S��un�n{*��Ju��>�;��
�
e�u���`"L���@0
&��D ��`"���D��&f
��p!�
�Vy�x3�YMz���f�ZW�O��B�U�6���VO�5"^sl���nZO�]����\WeQ�s�	���
j�=��6G#�D��p Db��@$/����2���V��e��;;/w�x8�z��F�Vl�����z^`�~�?T �m����@H����7���a��vJ�����?/\�M����{��X;��3�z�(������4��vM-8�
�(C=4�/T���F/�G)=����+��+�7!��/e��������3�8�c�t�f�cIw��%�s�h61PN|��-���E���'��g �3,�4~�9���\���7�����c��������!S����K��D����r�`�O#��)����
�����������s�:����i����m����E$����s)8��U��G*���<�L��=�U,���0~L�M%C�M�v���N�V%��"e����B�F��E����0���9�������m�l��:�T��U��m0�r���k�������\����Uf��Z/1��p�%�z	�^-Pq`�����[�k%W�%���z�k���^v�6g�����V��ahD�
��2.�d2��X�
�b0��9�SQ��
C=��@��
�o�h��� ��:�N�I�Q����	�M<����R�]�sn�s��a��}�}�^pI�u�t�g��_D���)�Z�P����3�@�sn"���7�|L!�@���2����KX��+�\-�bf-��f-�g-���Z�EO7�sXGqY�6�U��j~_�����HO���V�&iR�t�r������~�T��8���x�S�c��������Gj��[7K�~J����$�ER)n�D
��Ha�)<w#���T�������E�cXFk/�)D�$L�VR%�u����u?�2�d���������r�=O.���q>���g�KR�bs���
�\�\����?!��p�#X�U*����4��$��H��Dy����g�i����2����Nm�`�I���}����0-��h����W��^�n5h����+�/�T�O�
�q�S1_��k4Z X���.���JO�[oF;��'�]�����lG�O(+
*��>ET-v���i������j�_^*'�S���.�h-�V(�2�PF���2���x(#e�B�(#;�B�����2���E������(K��1/;c��3�\oHvb�<rt��>��|/�]un���L
5]�������u<��$�Vrd|f�&|�A���p����a���gyVN!~)������}�.���P%L�#�Y��,��v�~s��w!����k�2�iMcF_n�S��E�?�3�xcf���;����)�k��z}���w�g!��!�@��PJ(B	F(�%�Bg!&y;BA�g�`U�|=��/2
�zz<�R�u����/:�1��r�D�M�"�������n�T��	����f�`�I����]��Djz�0`|�<p�6�"�/{�3_]GD���BW�Y/O;ue���������<�zl�����������5���)�~�'!v2�H��������O�f��8)%A�D4����N������<v'�	���N����O����� �@x����]U~���D�qL}X������b��.���0H�b/�Tr5��JX8�1}s*� ��t����n(�b-�b����h�Guq��Mc��8����i��S�0����5�C�)='K�7�m�,68�k�em���j7����[��u����C��#u�1 xR��M�n�%��B�p,�|#R���R�@I3��*ho�����Z�^�#�@�����oo��\<�*�h��O�����I���t��-�l���P�#9N��z�9��F�SN��H#��������D��Qp���!8g�������:��s����n���c�T���������
�B��z w���X��K�h����Bb1�����:����%��35��{�xH)I�?�v�G���d�]	�f*=>����j!_� �J'�8$;������K.�Vy����z^?��}�\�� �x���Fb�u(�zK�9���MR��nGvN_�����Y=��U����l�Q�U��Q#��z�}t�E��B��d�7i�POz��-�)���N �w<��?�+n�o=�������7�z���yYO��GD���W3�c��Clh�������(7[>���}�d��}
}�}��@�a�<���z�i���u���s�]&���"EY���A8�=�cXQX��j= ��a��[������y�t�A�m�����-��v�Y���'i�����
�*�Sc?(vI��l���A����3�W����"*�����HU������.6��M�W0J�a4(d:�F��pfs�"��	$��������������e�n:W����������k�������ss�T�����O�1>�}D�`'��A�������?B�����=�[�aR��B�����L2�&���K�N�N�~��a}��y;t������q��4*��2>��F>2s�b���+�? Z��e����c�m���z���h��v
��a�v�Apl�1j[��$���6�8��PdMb^
T4A�)��2l;O�W���V�J�3^lE����(���$�u���J�Z�`�e:S�U�I�pV�����k�5wk���|�8��
E��"�w���g��t��J�Y�i%-��@�������{�G[�H������K b8d+*5�3  }��c��2�z�mV,��+d����Ng>�������f37����~�����L�"�mv�^����
/�`!�Hj����FX�1�m�\������c,+��4��q��%��d+�+.9;+�7����7Jx�i�;��R=%��u��.=�`��'5��U5^V�W�����2s���&^F�/W<��S��/�NGc�-�O����u���,�r}(�(�z�D�C���4N�����(t�@��K�6����8��%Q������I��2���~����q9���������V�R$:fo����|�^!���
[�;7�h�>��{>������G0���d�C��m�����;����b
s��q�>��\��������'��Z�3*���dY�M���	�0J|��!�b��
}pB��S��U5*xQ��V��t�3��[����E��X�c�������"��j���<0[�M��h|~�Y�,]����8�w>/n�7�d:<��)�6�<�[?����pW��6�xc�(~�tt�����9�R"l�-�ER"�R�Z�=��������=f����8�Bo7b-k�
+
:�do<��Lnvt\J�Z��1��������ggQzM�s��O�0?���@c[hq��,�tU\DiY�:Y=�\�A0}:U�L��z,\<T��u~^��y���`�E��������1�{��fH�m�b%���#�T"�L��u���ZyB�����n�2�;{���K*R7�����q�h|�M�?�f6����������_�|�]>,������+��I*��U��b1,`�$��M$������RL�L������\~�PK��6G��v�G�T_perf_reports/20GB_preload/ps_4_workers_20GB_preload_5.4_selectivity_8_task_queue_multiplier.txt�]{��D��O���g�vR		(�!E$�,����%vj;���Y����g��]rW�R{N�;;��������g������m�]�G���;�^�-�u������3]�,�w�Y�m8{�������q���^*��[1���)���
_����vmrny�d�<I�8���|�[A	/�/!gI!J��.6�a���������7O}��M����?}�~�q$�>�����0mKe*��p����k_(��W<����S��i=u��,��5��M��
�m��Lsf���E�MP���.M>^���ys��I�.V�b��d��_�(��yW~��[���,�����S���2\�����v�Q?|�7���*����O��G1P����?kO��S�����a�f���6�~��/~r����_�����_���h�{�[��S��Q����a�]{�U�~��3Z�jF��W�f�����oY�,:��}g}^�_������n��F��3L���tC�i9��0�.�0�������izu����r��3�M�n7�x
U(Eb�O����4A;�
mE�=al��Jx��:\����J�0���Z�W{���{�wG�...
:��U��������R��(���8��#�����}S�k"�������3����[[=5����:���/�d�f;�?���6��<���K/�tS.T��g�]�����,��:�a�������j�!����QL�t`E����n���$�]��w���N8�Ck��VTs������>m���tjc� p�E�zSwctJNl�������k��[���M��^^���A>��n��W��R�����\�;��W.�8|��\�$�����b�O����mG�"��f^�%i���5���e'���DGN�1
���rqI�keG�U����W�z�n��/_z��"ow��o\�[w��W��nN�0Ur�s�p�>C>�>C>C>C
>�>#
>C>�|�|�:�����
t���E��u%��:�)Y1gTV^�p��$�7�Y��)�1t�>w�m������7%�6�l�������M��$&=>���$1U�D(�i����W�y���
���'nv���6�`�`���4���vd���x�5�lC�����/_��6�I��f]�������"��]�{,g�~sF�u�Pb.��%y��'+�c��p�`�h�s�	������
	F�n=�L;������`����]_��?���/�,!:�@��������B���S�
F����^ �������V	\.�T�D+���Ij�x2�6Eo���d_���Q��h���6o�\���3Y4���,��9%��������F���6�����C����N���P��L��W<����N���M8`YuT8`�A�7GCS�u�r
3A�w�[}-z�����[}�Cl�c(���7;��R���f�O�(��{�Y�L��2�(K&�G����ozjc��V����4�
r��=a��Ch���3�v��N��:��j]�M��\�-�>��R�G{�b�avO�#hGt(zfo
9�'5"��.��Cj��vm��,����|hkn���G�br�Mo`4�"��z���,)�Ql[lb�>��M�6FX"���X���A`��2�r�'��T���l`��e�e��nt�4"[�A�n�\�"�*sT��9�e����np}���T�7����d�hF��_�����`N��U�T}xN,��4�<��dN�O�sp" S�����uR����z��'FC�fc�a5v�suHi#8O_
-�����:��v�o�'P��Co�QG?��hr��o��,��M
9��j�l�h�q���q���~��?�cd�o�������}�u&���h�dm�:,�Q�s����)'K�<��vC*�be�p2���p���yr��@	f��Y*T�i�!#���b
�n6���e���Cf������W� ;_����|p`'^����t���y����N��>#{EK��&oR���?��k�!V�����A"w�G�%���;H�`QvnO'���V�Q��Lz���D|z�!�[���e��W]�Mj~��/3�=�)D6-���7�#A���o���,�8�X��E8_N�,��s�c�y<�2�>�W�"��X�<��=���AD���!�k���N��d�� _��#\.A��G&������-$�y�yDr9�P����yu��5~hW���B�Bt7O������_E�h������.I&a�r���'�uu��	��|.UF��UbY�X�8�^l!�����}��a���n
4+��!k=%wDG
�@���J��%P2�D���Vt�x�F�)�TG8rJ��s�,D���G����M?	�Cx�m�V�/���G<A��i
<�)�)�����`�k<F���N�Y{^5�=r7��s7��s'����u���s�y?9hk$�:�D����[��2�H��j/�d��y<��|�cC���W�c6�M/���G��#i������rr��� EU�:���OJ_J_�B�>�A}R8��G�p�`)�[����8�"�^.���G�a��}�u�������R��M�"-�Z�K��`��.�{��"J��7���H���a�zi�`Q��XkcM�R%N�
�H1vT��1��.���{A7sD�W����q����+��K
w&�e�g�
rr�a�N6����yA�@�(�c|d�Jc������!NC�OC�PC�B}����TW	c#��l�����l����Z=��'i:5�j1>'��}��;��]o�VJ�4H?������Z���
��r�=D�`@H����p 
q0J�zOG}�8����f�L	��l��Y(�f��-B3@�V����<�0�A������ Z�r�C/�+Q�o(E&�����	7��c����M�"��5�p�����.�VO��[=��������$%��������!�d��,%���w�O���D_�S�cP1�p�,�Dp�����;
:�+I�p7������1�������/�a�����!�����������!����)C��H]j�������eD���>-�f�(���_�2�����<a����TY����Jq��d:�B(��:$[D�,��Q�8��
�D~:�I*B.0����W�YD7�H�D����$�X�D�(M�����/������7V��a�X����:�F����h�j
6�^I��7�('��}o��S���F����Tw�y+pX��LB�!���9�D�	�I%���U2+��p��g����Prp��X���u���&�n���ac�6���
+�'�������G������FVq��2c�F��Vx�D�����[J�0%����&9�Iw����������;�'���
</+����:$9� k�F8H�h�2�
�QA`������h�&��)���������F��[�w�S�j�S���_�~|�������TYT,�A?��+��!����_`�����i{����u�2Ghp������s"��T�^#":�!;��wCu��U������c)n�fv�p������9�F5j���VI�������%�zB����<_s/:z�j=�S24s��UB���[>��w���S��C�
:��6T�V,k�We�� ���������������L[���5y������^�pL0�I��&�W���2�:q:M�wb
��p�<Ti6sW�~}�Hr�"�@�$)R���"
��y��VSB��R�sr�% �R"��J� ��D��62�L��f"Bf���m&*d���O6�x���{��an�I��HU"<�7�����5z���M8��7�����4�05�?C����n��Q��j�T�\ �h�>����d�3�������x��[��)��L
b��I?N��I�SL?����G��h��N.��@K5��jF�jF�j&!��.�LB���T3�T3�T3)��m�$�`��y[���*c�65���+�u���=����O��t9�����bX������p���1���b
�)~����^�V/��,�^N+!�1���8W_P��	#p��r���_��1����`2:��O�@�Q������������������P�����H��������_�.��Q5�p��*B��������A�0!�w���W�"��Q
��XB81�{h��!A���m
�6Oz�����B���������E����	3"W��!�q���]EE�X�W*]\(M�}���L��;�|�	����
4(0F�vN�gRr�I<�}�+���"��O�0N���Q�������H��A���g�@������[�������nC'6������B.!dX�K1�A���e����QD�mj�����������K���Z�\�f8`�����p�@����j�����}��m�bm^����n��7���F�_�T�w���#r���(N��N��>b�ki��7��|0h�Cn�oN|�h_�)�n)z��*�Xv��y/���P����])�y3����������YK$��Fj����q�,{#�~����eBI��n&JX���O���}.y�����#�`�@�H�����%�a �+}�����H���W*2L��R��#'.����n�JPA��L�Z�����IQ�t�Q}�4�Q�X��M�X��Yy�"�Z�5�EYY����Guvt�^!�n���;�-w�%2S(�l(eys�X>W~
�0���WR6��u��}ly_e3��*�a����}�� ��l���J�J�4��U�pe������c����Z�`jI{@�%��f�C�v�W�����\�V������>,�����Q~�X�?�?��V� �q�i6�0�3��c<\����A ,���(��y6�
�S���������D�}b��x	�fc[��=���6yD%.S�K����p�.W�K��e�X�%\l�����z�bW�������o�1�o|j@T�o��P6���{t�{'L�h���,��t���n�,����^�s#Y1f�Sa5��lT��&(��r!�%��gx��|bp� p�\�xtA�/��l8��:�O�2K�Tf)?Nr�B�N�o��hS�h��>mq&���j����j�1h5�{V���NiX�E��E��o��7��o�C��vQ��0�0��r�b�'w�S�����j��.R�����|���jSWpSWSWSWpSWPSW(SWXSWpSW SWPSWPS��G��-�-��L�G$�;^��=�1C����Bm�����i�=��#�-�5�Cm��������
���V^4��������y�]o�eW��l����z�I�F����f]-�o��M���_��{-n�:jJ���]Y��q���v��&W�s��=�/������A5��U�>�_<���Z�U��=���8��Fq����m5���tg�?�A�B=���e�������u}V������i��S�������Z���@��;9��h<C��n}�<)��i#���^g�YW!�y�x����M��<���� �f��	s����XYs��:i���f<C�o���.����d^D�V:�!�uc���kt��f�eUg�������w�C��K9(8�||�����YE��t�R�!�K���|��I�"���{+t���XS�����+R3u���L���d�=�5�En���O9��Qij�.�\��EW����r=�]O�_�������*�����N6At�3�S�U��k�y�SKMU�Q5����%��s��T��0�"-0�d��������}U����f��Fqi{�n�"+�3�����E'�;�]����9���J����`���w�����)����z��S��
9�v�!�H�rj��D���)=��<m7����]�eQ����I�N����i������Vz6�X��/�N��v�~�������i'�%��5��\���v��a����BY���*(��z(���(��T������=����!S��:�;��3�:��F�����A�� �p�����p�~��,�x��P��M��O���������OhmR���@�_C{�p�hP\��H[�H�I�#�#�$�����]�ix������y���_�^��{q'������h�d�r��PG��n��Fz"i�_�[���;�w�����J�/N��:���&_��P�X�;���7[|�>��
+�Kcl��_k��,��b�N�v�����+�l/xX6�>�-��������I����CQL���:�#M</q���3��*�"4��`��p'�;���+o���w�(g�g���1�����Oo��
U�-��u*b�J[)��������0Ma��&�0�
��g8�0��F_>GgFehK��\A�BEY�]�\b��~���
�Fcjw"���F�v'���E	�	�?�p>���j(�&p�PT"�
	A��D�����K�Gl�(�,�]��apk�}Y���,������*6���b���
$�D�X�&�DWcX�%�D��X�����T�&����,�Q�`�2�^
�Napr��aH`�2���sE�u?,�9'J�E�g��i!���.�B������.�C��-l�r6,m��*f��v)l��[ �g'o�q��^-�����z���K����Ky�	����,��n�H�����
�<��� �N�f=4xpk���=�s���`�Ot
dE��d����N����[����\U�`Zm7����m��v��M6�������g��D[T�}�����2��q'0M�"����rA���Q1�M�����iSm��
��wv�b�0���
�4
-���-vO�x`�	��� ��#�����
m,���AY&K�_��+h5+Z���FuG��ey���O�AF�l4J$�V_%<�|7�7��b�H���N|�����_o�l<��i_P?�<��u���]B	�J0�$�j#`�������V�W�?��������f�p
��P�c�l���%������l:9G��x�Y��X�������W������{�[�L'�����j�Btw�[�p}\��Dk����n���i�O]�#��" _+��!��!��!��!��!����{��u/>���Wh����E6�7"~*�V����&>&�7�����W(�Q��Lf �D�BK�	���;f1V`���8�
����Zrq�/�q����rl�lZ2o�`�������;��&���j�a�?�V�����|*WO���|���o�o���=�a6��������=��M�}��!������37����^��U��ZVm�kMs�=�7�<���Y�z�i���o�M2.t&x��	�	G�s8�v�9�1l�3����>;��,?�\,W����g���K9����J���H�?��J��z�9����h�}~�aSS��N�g��!�����O0��C�S@@F#L�\1�� ����a�^F�����:���E�����W���g��V��m���>��C>���wM�&=\������I��S(�r
�R��k��vo�"]��G��v���6����q�/��3��4�5�-�<*�Z����X���s�fe���,C��� ���AF���g[��v�L����<_=����F�2�81�S�a`�����p�tv�"�0��va�J��0T��a,�/{A<����-oW���p���������s$���Z��=O�=p��=��=W�=p��=P�=(_U�n,��B�,�-mvDwf��x0A���3mTd3�{�`�L�L��p����-U&U&8U&U&U&8U&(U&U&,U&8U&U&(U&(U��v��a����f3[��=�Z?Z��6���1�(�0�`[�3�h��+{W��:�_en�]����KH ���|�z��$��xxvm�v�m���
*�����SO���?�3����]��5��H�a��������:��u��.��=���X����z�g��%H���gZK"-�1��a]��VD����x�������D�p��`��P�
��
�����<@�a�
Ti&>���ny-e���n�?7��@@��xd$�%��������nn�7��J
�G�R.��Bi���b�6]�"�X�
��P������
�Z�b�t�����!�U�	��T�hRh$�3��J�� I����i�s�`^��a��m��o�O�����~k���F��1��v����F�W������6<6�U����!S$i2����u��f���.
�.o��w���nK$�|+����J���f%T�*c��^>��c�����s)�����~�i� ���L�_���\f�
��3���x,F��|u����,����L�����n����%F�=sO^�;��<SI&��{��?4����;=�t�������*7��H��_��3����������s+��-�WUC����0F\�Rcy�1xl���!�vt�F�R�M��#�,�}������3xKU������������o��#������vg�t��3x�k������2�R'�7��(#\���wg�)��FP�G��eQ{�}�
��8�G�$a�5����aQ��}��"�G��`����f�����#y�8"y#��w���Z�^�v���6��y���d���`z���Q
%h(�������

%h(���0�����;��(+v��'�r<)��I�O0,��;d����']���`�'���������qp<)��I�O
�xR8��BY��2���9���)�eN,s
c����Y4�w���4kcR{PC0{PCH{PC{������d��f*K�d?V��	��<�v�i+V��P����t��|C���';��w��P�qb���8$��<r}�H���r����G��F���
;�/Y/�7X6B�k��`qv��*�����z!sT �cb2||A���ddv	��"����s]��*CO���G���e~1��2���9�~�5�+L����w��y�)y��/����Nx*|c~��H�wP#D|�/����	���B0�dB��,�f���_.�3�(k<���	��� �0�`\������#H.�ix=������q-i����""�x8T����U��`u�+&Ta�F�Vd
U�R�����<�Jz�g0��'-��	[n���ylx�7z�Z-�N�n�����!A���|�@Vv�8�L����t�"�/l��E���9�2sPl�4��b�eq�X��N���fG�����m!�'o<E���H>��[L��X~��Y���4�&�4���JrN��������ozf ���=�^�H\�@�A���k�Y2��g��G�}�����x�30-x��o�N��B���:��t)������p2� b)�
�**�A�1LC�#L�l6���A�l�K��d����p���Q>��X�M��X�"�K����v1j��.n�F�Lx�J��8��;*q<Au&#{�7��
�e.4���#�$� ��,3Gh�� ������6������~!������F�1��N�z�y�^��35g{!mx���h��~�?"�b�dG����3�i)]��f�O�0��A��3<����k�����;%Y��|�RJ^
��H�m��h����:
f���������~$����h+�
����� �Vx��Y����X`�y�#�R��oX��+��B�&����~��i�#��{��������6�c;�\%Z����������J�9�r1�}��l��j����s������S�!��6�)���lG���������3����[�i�����3��n9��0
����-��)g�������)�gJ��t��U��N� �IY13����R�bG����T-�����D���l����no�+#��3<����s-uH�ZmO:�t��' Y����J��C_(O<0��L�������]s8�t����j�KR�
�,�<���cQ�w�Z���:���+[i��h����M3s>�+;�3-H(���@<2�6}4�[q1�b �DX���l�\U	?�J�H��(�J(�L�(AJ8���@	$P�	� ��j�e���O��lh�
���fQ���T}Xt������x���k���#�P�#�i-�A�7�����o���)�	$k��3�I�~r��48�	��gc�t�{�� Z1n�R����x8���H$��� "���#2�\�1dDE02p�3W���[������V��Nj����5Jd�jz���D��O��|�>A�O�����	<|������.i�AyswJ�bi!H� i��,��m�������,��P��c�����]��o������%�L��2g�%R3Q�g%&$�lJ$������Z'�����G���}X�?���w.������#�/�����M��b���k��`/��?��4#�8��*�(�NoQbj���'��_��7R��.���Sr���
��9�#4��D�@29V;D9��?����u���3�;	/����W�lNw�]�E�������v�����+g=4����k���7t��^�a�35������(*�\~6am�^Z�`o���p�jQ��[�Nw��p�b��&G�E��t�	�����z�T�s��w�u{8`m�F����P�6�BB����A�$����v-���HU���s�BE�|O �����5�x�@&�k	bU�r@�
h�@h�� ��X+:t�q��~�9��?_�>����^K�����9��c���b�$���d$NlA'T������P�BU��
,���?���Y�����{��U;k\cc3�K�������q0��lr�V���
��cY&*�zg���`ci��������Q�.�)v����M�������5���w�_��;F�z�})�d���l�H[�������q�F�`�q����$���<��D:�vL��A���p!�H#��A4B$�
�����8�6���j,U���O��D9o���e���'�[$�Jg����H4"�)U��h�6v��$�xR�������.���;�����#�i6�j�*Sh�������E�s<���nn���&��[��S�,h�G�@���C��MY77��uS���:}x�H��<���������NG��������%�E���������w�qv`�EY�L���mu�xI��42���s<0)�Bgf�X�������<���u��a nJzs���e�4����G>l�8b�S�x��S��T���c\�������DV@���P�����>==B�3>}���c�;m�cx5Fo��n����!�JG�$H��i����	~��*	�����������J�]�^7�u�y���Eq`��{x�y�M.����M�o	��g�o:��|3bS�L�
�M�#����u�Ss�4����i�����xi]X�K�+u��3*^��_�����+����c	��q�%���7������xj}�����<��r���o�2�!�D�y��K>����D�������5.������D��_3|��r���gy��'�Gd������m �+��-��Z-�����BPl9q��l�-��/)�����9#m������4��3�s��V�����o����}�e�����R��6���5��v��^V���z��A�U��m�6x�u,B/E�����Pz�9d�X��m�&��D�>���4�%�~t�>�2a��A2�VT���zx��q2�tT��n@AQDT:� |PA�.���B�E��������3V�}�*�I:�[Qk'�@�����]E(�	(��c@W�{���q+j
�q2uc�DN��^����0����#/G���$� n'���J�a5�O���<;���|8�q�5�E������C�P�Do�<�������N�*�
�Y
�F��p�Cwp��E��Qv���c?��.�=�j��~+��I;�$�4��cd��d�:���:�v�G���r����*��
-�bj��RduTc�R�8k����X%�]r�U���3�%�0���������N�DJ�]��s'�cS��4�}���`�������/�fn���7��=(7[�sO&}|�r�d�J��A��b���em��&�9��60H��}\�#��W�^��i�yqR>���g�l�&�X���yx�XAC�f�cUx��	�a2���g�-V�x��
�h�.��.��(c4��5��M���b=�����Wd%::�Y�'YQ����+

!������[��-<�E����t@�i2��������{M��e�0L�Kb����p�xL�����@10m��@�X���,
F �k��e��q2�3��}f~���A���+�}}O���T-S�^gb'�>���,��������N�������tQ� �A�����u��~;��r�� ��#��}���}����I�����E��^��u#�s62�+�i}���)p����!����������^.t6�r��k�`����H�i2M�)}�����~�<�-�������������G���>��s�~g���yVK�>�[��un2�m}>w�G]>��u�A���J���7���*g��c��Z�f[���I�aS�I3^�z�/�]U+���V�,MW���vUa^E���������":����1�����N�������r@�M*F���4�\MT���7���!�9+�Fj�Q���j|�T��a�3?P�?&4N�<������lp��
/���/�������,�R�1�B^G-`����!�X������Q?>�~�g�&JqR>�����������-���M���}��
K�����Txb�q�?A����1n g��-��% ������[�a�f���EV�l5����U��
��L��p�������G�e�����mZ��)����eQ���?�(�9l��4�o0���$�Y��>='����=�tN��k����$��*���g\H���$M��j#����x�O`�	<��'8������0��i���>H���u���A�`�7rU�G7�}3G�����82��b`%=��LL��$��Oz����X�"������0^����|��U���YY���G�5k9��3�����������8v�"�}:K��px�
(� BD����"T`�
�P�*8B�B���0BB�i�Iz���y���~����1:k�
����9�}8����r�W���{0F�z��
P�pPo��Il����;�����~_�����w�y��|)���V�k��*W��������Wn�r��{�&N�^���z�L;+1W}�
5��V��M*JaL�v���R[[
����h��V�A/����+T���Tf�x�G�E��
+
o���d��67�zBQ����Zh�^�E�����(�Z4j-�Z���Rk����j-�Z������pe�L�@N��0�Q������b.�V�(.0b�F���U�G�\[��K�W�|�zS�NqE��j�S$:��0 F{j��fH�r><�T�_�B��o|a]�r��!�Y�n�|Tb��Y%%�
~�����8�41���G���&�r�w��L6�BV.��oKg 6LZ�A�����o�����������d�3�}%������=��	����=m�f��1#��N�����������F����R�C�{0(���C���4i���%:����_�"7���\w{���bH�������U��r���"�P������?;O������`��L_!8��K�����B��LH����I}�/��LY�Y �~�Y��~�(
�,��6V�t#�`��Fw3L�
6�A9��������5�L�E�c��%=��!7�<�`��T@�

P�* @��
�T��{�H��~��tmu��(���G�dTn��������j�+���?��k7�wa�����I�[��n�W1�?���zWf�|Y���|rH��Z5�O��#������=�u^�k:n\}���E�d.��v{�����b��2��q���n���?���c�:��M��)�u���-���O�������:[�a~�<��{_�2���	�b��w`{�@�VM3.6�(��r��pCW����tUl.������b��'"�����g�s6Kf�����':
���O��
RT�c;���&�1np�'���>�
|w�w`����Z�����`�)�����F�����$������.�;'T�=����]�30���R!��K3����\�g�&i�v�t�M�MK���2���m���h�Uv�[���V"H>%*����?���J}oe��R�����6,�8�� Y�!�*�_�0H����������%�G�Y���c�9ZQ$���LAm���Gs�a+����k?�%����wo��>��'����k}�Z�4d�jW��
��j��--zm$�C�>�N+�pz�.2�L��
+������Si�>�t��w��AsPt� 4G������|=��@���z��)�N�a,gt$�&�>Fr�~5�x�n?�'p�Yp�m~x�����3G��{Ip���� ������a�F
j|,~�n��mo�"oa9�`7������coo$I�b^�Lij2��sf`.� ������8�����!I����G�<�)u��N ��E|��v�s�C�MH��=�����I���^����j�U����g��a��Z\%aVX��4�GI���f}������J�Yx5~+���P�l3�R%X*��9<}���i}4��$�os��������L��,�&W�����W���$�JS+�]g���CC-������#d�E@h��M���`������a�Z����� E�W�@9tmt8��.8�D���!��i\+�3��yY,Ora�W�39`s�e�e������;�wS���V�'�iT ��bg���M���N��:�O6��O<{4<������a�����fyW����S�|Za�7�grXm�q��T���[/�����
����r���L���3#�w�! ���U�/��4Z���l}�Q�1���q�N�0��#|R�M����S�V�N����U(�b`�|��XzQ�XZ�/�u�D����r�GW��}Nu>W�����EL<IxY��\�����+T9�����	Zx�a-�~�2 {1�����2�_�7rv�{���t}Gx4s)��/G�K��������J�\
'rb�������i��#!$�b�!	��Fi���6	N���/i�8�L\��f,/���3�<���/*������K�u�+U5U��W��Z���\b|{P�D���T�0�
�S�4p+���\�,����/�7*>�G�����vN� hB���9n���������x����DE0���[���S�������_PE������9#�i�O��?�O2���{Jm���>{�X
%IGI�@�
�C������K�DC�2H�}���c���{&K�>%��By�~H����1�k�=��k*C���������w��>.�-q2l����=|.���X��K]���I[��E��V��z<=k�{��x���E�[o���|�2�`R����l�b��$i/�O���.�q�C�a��j���!Ee":���/�4l�:B�����
���j��x�Gj�89]t��������������3�}Mc<i��Eg�P+j��L�%hb.�.f_
]V05��C�oz�*����}[�g�?:n�m����E��M�mf��}<������2�2����g:_����4��E*����>�����'���g�/F:������g�h�n57��M>?��K��h]9k�t���g��}b4��w�X��q������k��kVS����uk������39m������=;I�M���^uX�7�4�Xq��;C��~8K�\>]m�c����
��>E��_d���,�H~m���B
>��v���^�ha�t����?�U��6��G��`PK]��b&$���	�Dm�@7I����(I�7CE����p^����)�v�9����`J�M�:DCz���_C|85�jx�nh�pZ�U��.EC���N�S��$s�|���4��VS+���@�J�h%���Z	<��Q����(EH���4@�����""f��~��q;7$����!�/�f���'q�m�DQ-�3���v�������^!��:m��5{@�=���f�����n����/y���9��-�BS��"^
�UU]w�N��=�|���X-�s��*1�����@[uV�6�P�}���h�#k��?t���T����~��q�AZ��9S�N� tSB��U���o����w�PDK�v��H�
�7PKS��\�B���8�R��>P��E,�BJ������	PC�C�!���K?����yh�:kb-6�IS��Y,��
Zes��[��������W!�p�@���9��*
��.T
���\��9�*Q�����������97tA6&�V�X����+��?"�A$�t��?4�HV��>M0�u?`���H��$��q��;�5>����k���Sh���A���-��n!�{��) �	h���/ �x2t~���1j��BP��\��_����R7��@�?��A�sy��=��h���S�Z_�S���f��v����0s���.^�����X��.�a���
�n�8[��m�Z��=V��/B�#���8hi�v�8"�����Fo�B���*<�d�P��rT�*Le�^(=b%
[��������^�P�M�N��_�F�d�f_���{���?��+�M�?l7��<a%;���G�/�MF^���~=s1�!��O�{�D\D
�Ld��y���7�
��*�S�X��hH�8���F]>!u�r���7�x����k�=��@&��s��s�f��:��������(��,����N���������UV����^������P�HqsMU��G;z>!��*�V1N��\M�����qH���HT-j��k�����������B�!�����(O��e�����M�T����L9����b���]�N�/���{��c���e��1�l$@7�Fd,����|�KO�M�M�Qx��Oh�=�p��������L�h�<�O�s)��r�������H����O_���
�����c���"�X�1�eh�}����$�uMb���������L��3~~�%�%���{u"�=m>]�W���H����T�t�z�=/�2��<�qD�s�}5l�6�{�� e��"��3���L\������y�������:
�:�w���Br	�]��B��}�D���
�C���[�]���_���}���<6�RV��IR!�q�6l����Mg�O@Y�`&p7�������[-��F�_�>������6�$�b��S��i��G�\�1�f'�����R��}`��7=>�s?@������7��K���,q��%U[���h{�<������"��F#-n�U�'��o��3�}-��P� �'��G�[������9�I],�����'Am���:��3�->�C �O���T����~��x64�XH��e��rL5$�\:G��&���r[����%����4w��F��U���
���vH�����F�.�Aap7�k��4�z?��nJ�D�_Sd��Vl���T��S��U!K�)�s>�}�~>�������o��+Ge;����������
>�W=[�V�wqb���/���MX�����v��G���o�#m�Z|�k�?����(�k��n���IlU|������y��O�����������r&���!="*r������:#�#2��x;���Y
P*�o��E��T7E�^X���LV������E�\U���������=:��b�,I���`�+���@����1`|��9��W�B�n+�~���"�gh�=Q��h���|����f��9���E���,��8�V~�
��]|��M�pC\���������,s8P}�G�����b�n�������q�N����?�+�PJ"���t�!W��]7�������glW��A}'�^�
X~�snDMp�m�����-
���g$�^�t��e��<��*]����:l
X�)����u��@�"�[�H��R����2��<~PK��6G���fl�}_perf_reports/20GB_preload/ps_8_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt�]{o�6���B�Pl������.X�tI��P�D����(9��}G�zY��3��V����O���xG2�_������q1��'�M��k�i��O��'��av{C(�x�S�O�/&\�1�Q..��Q�o����������:��
�d����fwo���f�4��P�����0�@�u|�����H��q�kp�f���~��/��4~e�/~����8�F���5�������@�
���O��y2:O�p7b���h#����`0�o��%��}5s������D�q�0NCc��&�fk��P6�����\�O�x����
�����z��f���[.�������E~��h���E����#~g�	���
����3����?�������������3d��3��	���H���tS�^���)X�$|��^byb�0�������>����+H��4�����c�7�����O�Z.A�����a��G}�k�s��B�/g0+6��7�,p����������8�����5F-��������������&�]���s���?��EU��]�vzz��(�cv�!������~��*��(�5��Y�����0��aP�%�ag��R�n����#������@L���XR�$�	�J��xI�k�&L���,[��?�NO�s�*����������h����3���C�x#_}���`	�MB�`$��� �
�FhA\��Q\,��}N�E�K=,��{f��+�,��F�+�7~\��&�-I�;�>��k��6��YT^FH�@�h��	��8�$0*hIB�8h�1-���C"G�,�����S�6q���@]��UT����.s�E}"t^tC;�%%�2B�����e�`r�$�;R&�	!�`T��>�~���1�ny�!Gzh�U����ti�3��Q�4����6�q��bpG��DK���*?�an�:����E�hA�\�V4�Y��
��RC�,�p�^�(v����+9�bX��%���F�uf�����F+
1_��M~K#C�����������.�V�����_�;�~�^���npk���1��6��a ����`�
>Wi8���K�#tg�"q���4N<��1[��t�g��J��S�������z��K�x�b
�Y��d]����Lh�� �f�m�O�33w+�L��4�^�1���d��?V<�$3��	�c�|����%����B#'�fg*'������`:�V��Jz��=��������n-{0��Vbv����[K��Q���e6�O�k��:&v����9�C�Z�3P�Z,6�������j�2�&�7*�>D�'f�M� �w��
Q���F����#
�`;4�n�KtI�������^�4�� 60����,��n��&9�$������1�-8��*��:�qD��e�
 �)��X���D�T����JV�����w=�����7�J���4|�������q����
GX\���T��'"�;�6��>�7����Cz�R+5��J�3j�o��.� �����_�L���=~_�����gL��g��=�(d���A����&c�����.��VVo�3���������������o:k��|#s_�x#����W����@���������!x�+��}LIk���i�
�����]y^��HT���.���O]pf ��N�b�T�H��� ��w�����9���s���5��Wl~&���q�Q|���:�xQ��t�ua�)X��,�Vt����L��(���Q��
Z������(�Qs�&���Cw2�$��c�QA��
Km��Y���Dk<���������,�A<�]���"�;^H8g���Yl�5d�e=`"j;��QW���x�Bk�jY����F7����~Y�C'3�ax�XT�b�I���!V$��.�|A>F�l�Tz�Nt�������b��3e�b�hmN�@�,b�&#Z��� !S3�oei"�Me���
:n�@��u�%��4I��Z���E���@k���F��jh
��A�9:z�H�����LX"���dK�wP�����I98��!�Is�L����{���d�R�7�Na
P6��"�b��{\Y�����\C�=�����,��KN�����
�jD�dh
���3G�� �``4�L.|X����W��_�RCk�����"C�<~�=��h����V��G�Z�F��
����N�.��cM��A�J��-���� �:�6���mlf�`+��U]z��j���`Xu���
lL��b����<��#���)�hC#WM'b�[f��KB/!�ML��"�zw]�^�������1E��G���R���%+���!`�s����H�X�=�9,���E�
&�-2����b���{���HZyZ4���M.d*'K�>Q������8
aF�A��L�(-�L<����P�o-`��ld5���6��z�X�5���*
�^2���;HEs��Z��3T@s���R�s`�O��in'b���W�rR��$&��.��2�k��I���"$i+"`{J���G�����{?���qw62��l�D��[3?�EG���
?;Q�>�(�>���5���x#�9���������Q��M�yC��`"|^R�c��h
r�{-`����kh�(�;h���&���������w'y,]^Fp��5Q�akR�>�8�-�k�!�@#����3{i�Wd%��������ra�"��`�=�DU
�������;��6�1���kW+fj�<�y�B_����!I�n8�`/E�-���;?iEPN:�8s�n4h@�~C��7��W�6�gGkiX;P)'������)Zs4L�����g3>s)��`��.������� q#��eEGl���:!�bhT��n���b!�=3�=d�
������.��s��4)��%PGk��"�Q,S*P�&Z�Js����nT���uTT�R��Wq!�]�p��t����J'��B�ae���g�����\����=����b|!��O��A��<�C��]�������7@�Co������
`���zh
W72p���D��a��RW��*��J����DD��[�yS9�h��<�z��dY��:�0M#c%Q:5B���G�VdN��S������G�B^����� w�h�tS��{�	��������0�
��������w�k) ���w[�����f���Q&W-��Q�q_H'�'��yW-���S	�!�6�����������9~�����
�^���X�r�b����������K�'�tJ���Ll{`���G�e�L����g�&QY��.�=QRNx�b�A"z'C���pYY�����+����$Y
5J����N�(��N���c�@N��_�b��*�vtCV��,��6������D�&XJ��l�]�;���N�4J��my=�w��)>	�����LY
�,2�/a����W���	X)!���
�����"����dMC[���Gg:$Z��5���@IV�u�VYvP��&ZCl����@
�'W|�����}H��=��tz�v���F:�[A�����UEkM��-�����o��c>~����V������=VC���|�VD��S���Xz.�.�/���2��w&~��|��E��E���q�`Q�7����������Y�������*(V[�62A����(����O�i3	 =K��$�-K��(���zh�m�c+���%����Qm
�m�*���T�k8e|K/���9
h�W����h�?:O�`LS�W�!"��^��m�n�l��,���h/V/���&\'Q�jg����Gro&W��F��N����fo���cz�'w�?�Y�h��"�C%��P^]��I����(�+i��������>���8pK�i�S1"	�?3��n�h�J���EGcv�'��8��;n�ywi�4���;����-�m����/�H
y�y/����,���Am�P��Q8~��+�pP�p����Z�cG'O� �������f(����j
����i��'��XT;���}NP��Rp�c�>�N�S�Kg�����&v�`
Z������I����M1P�������P��PK��6G]�r��<��_perf_reports/20GB_preload/ps_8_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt�k��4�;�b$��6���EB� �8@ t�����m�^�.���8�#m�d�������n����=3�W�w��My�]��u�'"�8�~��k0M0��M�c}�n���u�f���1xkyc��q
�X
/�?���Y��5�h����X:���qk�E2�[��a���1���K�k���%�:O�eaw���G�������@�>�~eb�����"����?��FW�3�A�O����W����|)��-�8O���1>���[�J���^b���a�n>�mW����4�hF��?N�Z���c���Q�5����{)�'�������8��y}��3o������.v�����D�[9Q��E.�{�	��0n�.����,�E4�z���~tq���������_�a4����fK�r����ql ����������>�5�S$���Vk���o �o��r���
���8��x�u�i��}b[G78�������x��"�:./�=I��g?jyq���K�B�8�������Qt8���{k��E"R�e8��L�0�4n@�w���W}�����?wWWW8�������GWWSQ{G"�3Nn���(�]�=��6�������|��`i�q�m�W��}�tgn �q�r���Z���58;�F�N�K^K��[.��U��
I>~	X����y���g%��p���h�bx���"���:s�<IZe��������p�uK�M
7��L��K�E�-	�X�~&��EKM�MEt���
�8W�����$����7�r�����^z���n{�L�\�[����_�n�A*Hf��K���| N>P'������}��5�@�|��| M>4N�NlY6Wl��:��e��	-����a
-��l[���wn��1��A\U���5g�-���+g���i��x����I�������YI��-����*9�*i�h������k����/�W����k�-p��k����[��@^�Z��(��,]��~�����a�#�?�a4�.�"*I��0;��X�p�k�M5{�^H�x�����8��R������:�
K�[).*�%��8��Z5"-���HK@�% �i	H�tZ-������DKe9a�
��C
8�A���R�T�hSi

����#�PyupN���I\2
�)��D��a�M�b9��8�n���y�P)���q�������W�����A���lT��	��$��r��R���f2�b����R�,��l��1�p1I�W�������3������I#�^c6^�I������s�:��7���)�7vpE�3�����@\����*����>�wD"i�
i9�p�:�XC������P@�P���aG�q<�|%��w=��W
�H�|�1;���i��j�AM4��s�3�r.
\�M�����G�����*Im�q�G*���;�e��)���s
�9����B�j�3�-��p�yT
�[�W�Y����S'c��F��������a�rPq�GZ�,W"�����c:�X�����To�	Q�=$3���V�����2Yo_���1
E��F�'a�pg�|.�"w`�5t��B����Uf��79x)���W������y�H+��������c`UH�R�5
9`��d����"���JIA}P�$��
���j�i�
��.�SK�
[Y�0�9je��p��yX^�����{\�=P�1�F����BD"!��@o	'�9) �y&�N`��G�{�,3j�z�v���0#��Q+#/gB�q�~��6]�]o�-i�$H��H�|�t ��3�*��ZPe`��������oV;�E���
��W�d=�����h
�B2��D*�h���jk?w���
f�y�M����i����H�b�@���)��r"�\7�#�:Qk}�(�������/��_��J~Yz��N(���"bu ��T�n�+��U��u��\a��������:��fL�^n	������������o{���z7=�9vnz"��ymvnz~3ZjB����?����.��6�j��.�J)t�R��5 s5�q5���\
\
|���>W��A��A�������j�r����*��mg�OI��gQ{g�I;8[\v_8����E!	�!�b2���6��$lmj�2L�x)1,3���v�����e�=L�6z&,�����6���_�Y*��������z�����G�����o�KN��D�����[����
�z�_2Co�KN��~�
�m.���7���������T��g���<W��;�v^������o!zu{�����Z��R-�cs������'q<���������Kv[���j�iA�!j�����*
U=��*�J{ �w~�+=�
�gC�u�U��sq6����Eq/<��v9-��U�HVwu����a�����l��N;T�t�i�5�@�HU#
�FO���,[���en�T~��pb�{�Ym/8p\~C��	�A@���|#����7\�p}#*�a��w8�9�D����
�?/���WQ�����l��J6P�
����J��78�ep���N�c�Odr��������H����������p3���M\H�(���k<�m�����'%�(��=��ek��a��"�n���@[�m|��y(R�.��
�Qv������Y����z�S)�Y2e�N���4��/0w���M�V��@\'_�H>���~?���'}���O����$X��C���6h����*���C��9�o�i��td���[/��oO�����C�do���:��mC��ey�����c__Z������0��Q9�9a��8`�=`@�M9�Z��~�������>�pz���F?V����w���c��m�8Dw!p�����l�pv�W�����$r<������
�v�fO:� 
,�X�������n�)�����V8�(_0�q��G�G`��J�W��(w�@q9|�D�t/�����d�[N'{_#��8������_��F}wk>�t�!��r����C�M�<���@'Kq��&�k��&Kf�8&�`�������x�D���8���A����sN����p
z�,�D�0�0�0�0�.��������U�bJXAi�f������� �_������h6,���T?Q� P���&��8X��h�Oj�c����az�Z�c�zc2����L�����_}T���������uLl�����������yS�����5�/B�f��cl^��Q����q��0��1����}����������i��(x�}���%���mu������1�s�Q����
M�X��d\D&7��������`~�>L�xD����6`r��	����	���*g��>����s��������0����}6�}�C+�"6^&�E��L�����TE_�R��'����^�7��-���_�C����U$������D���~�{�Np��^�PV�~�^a5a7�~e_����`�q:{�<����`�J�S1���*F��1�9��F���@o4	��d�x�K���s^��v��`�1����"���v�#�>���H��/�]�Z����b��u���9T!SM%iM)M!m��A��L��W��b���b'^�s����q��m��m`�G����������P��Z�B�����rk�N	])r�R$�!��K�i"�B�{�^:�!���D����^�;�T�ZN�rz�B�1/�������#jV�������Y�&o�F���i��xmp��������^�2�[`���U+s��g������U���5�[���JH�a���NVaf�r^�v�
��c��
��RWh��L[�ip�V�=l�q��q`{0.�������]��<���aLv�����#(�v�K����`���b_^���%���M��Oi��3�a���/���� �W����C>�	���$�(���T1
�'u����
�����/r���3�Ut?k�m����������3b��,��U!K�
�C����&C<q`��2�������~�r�m^Tv|Q��Ee�-av�a��\��^F�#���vC�z$�KF�>����]�B���-�j����n�d�=ZlGiuU��������(�5 �:��j#�jt��j��jt���w�����V�R�6�e�]�S�6YV�4U*�jt�K�| N>tO>�&_�A:���'*$�!�=Bl����z�a��;�#_^�8{�t.v�����b���-�=&���u�<3�EJq��
&
p�a9g��F"Y	�uH-�u���F�;����{�0������@��\���C�v��"�^dq"��p�-�q�>�G��F8{����a>�cK�%�Wr#�[/�=���g�;O�+���7{�ea*O^����O������SZ��.��A	���~g�{�����Sir�����j��4�K���(F�u���tQQ]�g9�O�J1����x�m�d/�j+��b�����`�j�^�����|}",��Z��6����R�����w-�N�0�W�a��$MK~�
K�S��o��<|<N�r[�������{o�ydY�M@%cD�����:ox@v8Mj�C�~���//w��>g�z��������z��K��_�cY�I�������3��7���F�c���-V����[�@3b`����P.��������t���Z�c�/���s��Z�����,P,x�D�7��M��&x�m?�p#�rr�\�jN��P4����<���t��y����^xU'���y
�{P����OS�
M�[��Q���<; ��z�R�g8d6��[�&v����.<�����'vQ�h�]xT�jQ�Z���B�ZXT��4�g6gs4� ����[o�=ir4��as4�����e9�$-�&eGmCG�5���r^���;l��9�����@�w�U�����5�i�<
����Q��4�X��`k����5����Z�,-��F�0�)�]r/}k����������a��Z�,�[�
zQ�u��u��g��a���G�m�[]K�����%e�WO�wO�����r�����}�v��O
��Oa~�/��T�Tu?��������;�i5�;�]{Q�c�����;NP�s���k�\uAyr���]�iV,v9�,vQ^-v�-v1f.v��M�����]�s�/�`bx~B�M��CA4���h\��^�q6��H����+v�W��J��E�b���u���nP�+%w�i��1�k����<����0]0���y�L��gr���"����8�'���c��3C���?o:��@$eD��s��s�xs��r���Q2G�8%�q���Q��Q��Q��Q���~O;S&���R��i��Z<5�8ij�������KS���?M-^�Z|4�'];s�1{��;�.6��7��C��8��Z6s�k3W�6s�}3�(�������.L������x�V�-�*~�U�
��OhU#�N��8�'Z >!_�����0ox�0o?���y�o����e7sss���	��d�y�S���x	O/�������Gx�����Mx��)<s��� <�Gx��!<~t���n=�\~�����14�}_Ly��GA12Q��� TT�=
o���U���!
8Z��i�]����?IK��!�c���;N1�9_�51K��s�B6�8�-8c��X�06qjQ�f��=m�S�2-���'5��T�+�2�Y������^����,,,L���W����
���H
���6���%b%��I�iz�����0T`��2Z�s�i�����.�-��KE�.���_\��E��^Dw���2�E�x?<���'/V��{��?��:�T������?v7�<��nzW�Uz��j�1�]�m��2������>?�go�7��W���l���'p��b��^�be�I��*[6��,G�?9�`kg����0��Z�b=��b�"
s��T�G����(L��3�z���?�!=MO�e�-L���������>B���Z���_L���`��M����w����_�W�~��hqtQ"g@�Rq"�I^�v/�6������������H��{����e4���V���~��+`�
��|j�����#�����c��Mq"�It�u8�V�-�bK�C��(�����Mk������!�P.�!�c�8��� �"������O��{�K��z��d�*�H�6|�J=0��$��ix]���.
J��4��3^t�����4��aw��`��b��p��_E�)�A1�9�+
���T�
��-�C@�����:���l'1�o�k�pq����������q�q���{����o�`�z����h`������V^fx��&��Q����o���hG���!�8M����������_t���e���?nF�9_fe��b�!��� G����}���b ��
U�.���|\��9�b���{��O(����#�]8^Q�2�b�x�hH�����.��.v��v�������d���G�r8BQ�\����Y����D�n/2���d��
&S�i�D�	]f:�R��CM4tR��]	������,<o��`�1�����A6��!o���?�C�&��FyjQ�q�*G�*�N$�8T��C
3[C[VUj�l���{
��eN^�����M�8�
s�-js�-s�5��r�%bBp�G�Bud�BK��P��xU�!��x�nxT<����2�J�`q"44`h@��4�tW��-H��' \P���@�C�{��1'���e����L��I7?��;�
�������?���>z�}{�h��f��@&��<f����*��/'�|����Y������g�H�a���	1������d����f�}������;{?{�l��U�����e�IC��T�r���PS3�!������3�*���6����6��r�O�|���G�c��1l�7/�o9�������0�VH}����/�����a��/����:����g-|6��d�����dr������O��������f�%�mdX<��%=<���T���v�
.���*��p
{Eq�Ds��r�|H���fi67,��;P1Sx�X�
�-��/�}vt���-�(3��8�*>��CD�$�U'R*R�,�������F4��7{4o(�7�����
�
����zC)z�o�	-���)8����[oo^�v/�n�����08����C����d�KB�	 �!���2��~?����f�T6I�]���O��
���C<�.!z��C���z�������/7������h�B������_�E�w�x���]�P�&��������%8��a��+lTSa�z����&i����E*�2E���_x�K
�SO�E`$���D #96��q��������!�b�m�uS|��(�@}�
|��N��B�q�������]K��6�+}�Ue��1�$�Te���VrJQ0�1������ac���fF}�`7m��VK����Nn&2yD'��;����i�~UV�is�Io`
a�7d�����oC�0�����yhh�^�D�P�*������/kT�>�4�OH���Y��D�������a�V������UY�Y2{o}>�8U�����"�	��s162�i��cCu�{�����-:��e2�$��L���b�a�������n�,wo�x��0619{j�|*���j�>L'ul��s��T9z�CA��0�|���1�
0���C����v��������T���M5�B8<k���!K���M��SQ�y�X�at��E�����~8R+��`���dRU�B��6M���]��]1���o�>���'�
��W�p������U9����i�f��^�f�w����RN���{[����8��ID���c��<�M#��DG|����3���RK�dd\�;���S@�)���s
�9���yN5�`����Gd���ad|:Y����@GV� �F�]<L�A���T�Z/�'0�N _-���\q ���|D���O�n���S?�@QN�FX������l��@��Q|�Hr��_6���� �"�Z"|����.����4��t��r�d��'��r����d�i�D����A�(�� �5_[>-k�d%��m�����<Uud���|��|R��%��{�$s�# e���S��l�?���V^���M����nk�����d��Z�>dY��N�zx���G���M����K�7�Ef����h��@�g�
5���b5����5�?�%lN"�����9]����^CW���Z�]��E��
���y~V�h}}{}[}�k�\C�z$*C�\��
������P{�!B�=�������g6�1��"��E`eu����������|j�s�iN?�B������P@���B�tA�,v�`����
��+�����e��$��P�0��R�u�'������W�O� �+���
��P=-��������$O��=����zy�7<����qRL��o����C>����~����63���>���b?�������Q�`V����t����
Q�8��I/��Du*e$�o���Q�|���_zW�������"�Y.xl��v�0���t���,	��y]���(>�����7�p%�N�y�|������~��������:��$���bD���<"k�����|�����Xh$tD���OO&����b���N{���X�4�cS�$f��P�����e>B�4��F������vE af����!Z8g���
}m�uZ���T]V��)/n��ns���&�
��O��g���F]�|:��ml�B����%(w���'���c���<3�����1K^nb+�\�SWuD�"���(A�!���xU�|�����r�����^d����G�"-G�suI p$@P@�4@	P@�@	�G[(b��%� ����Fr
�`A��4Xk���
G�q���,s;��j���[������2*���{��5�\���N����[������n���r������$���U9��J�Yw+�	C�"��a�9�#Q���fV�H�������m�����n������r�5���} ����/����8������^�y�i�����a����$�-��]���=tD����d��Dr�[�����#����� �p�TX�E�����o�^���}��j�o(��t����N&�%i1�0n%O��`{Uw�q�Z�L7�u�L�-<��m	���-�"z����5|���Y>2z�����{�4�����H�S��v���rq���">D$MY��s�S3��Q����x���9�8!L���M���&����2��9eJ��G������0e��!)��;������RoT=h�d!d� ����00"U���Y�m�o�����\�6�|VnH�O0_�gb({�/�dj���_F,�i9|�,n�Z�,�������^��iL�4*>���8eI��6�`�����Q��G�����M�8�#�|:��)������|M���r����E 
�N*r������KR�B:
	�	����R�0f�G�o;��l���(�9���A��j���S�+�����6��X�3�C4;Y�Rand��6���[��E"{�<���,�	<���$�<�H�$��<@��s0�v1�<�K|��8������TL�H��u�
��UM��g:����������Y��|VU�0�m`�@U�M�"��)����?�������.�������U; �s��N^�Ip�N�W����������n��Gm�\[��6.������*<_����7`����KC�>.
��4�����MKC��Y�h���ab�_2V����c�X�V������9���V�1W�(a��w�_-������������.8������
���o���}=i���dU���2R`G�
`XT!�������9y�9!/�.�2z1����e��z��
�c���*6��������|��\{`eD}�v2h��T���v�Y�Q��b
�2��qe^�%*<� J��A�9�l��n1�>���i�������Mu��Fa�t5 R��&t������I����(�d�X�y�:����2��D�/>��� H	0�P -H�`[ni�q�r��C��2����X��i���o4�y�6LL�.���EU��������@o�O����
���r�>�"�)�T�e��M���2>�T�?�U�/V��F�(>}��������r�����"6FtS�'�~]=����a�F���2.�~���x6���Uin�d������lu����������%����(�M�g�[��A9��pH�����;�*K��E�(�T���U����h�"����T;�	J*���\�g����FQi��"H�A�E@�����2��< �2W2���� Z���\
x��	d:�
+�S���cQ�<��0���{�)h��������'��w����Eqf���j��Z�MTj����[�[���@�7�F��u��n��y���Z�l��){���Nm�b?[��iq��T����?��e�ki q"r���"c��*W\��DT�%Av"��qN\����X���]
C�7���C�!�cH������|����d�{����
���,����#�����5��nk��'e�8!��>�C�6�H'@.�|B������9�3�\.�S#��l �������������B�/�������o�����K���Nr
ki�]+���V���XU�,�x~�C���q���Hc���Xj,@;��/WuY�B�4*^j��6�j�PP����x��l���k��{S�/��q��Hi�_��#�ya������!�&ci��<�9��78��1���#�������K6M��^N��O�s�*+��l����6�r��C_7~U~����Vo�EeznQ�Y[T�&>��w5?|��%�#�y4�JA� ���������][o�6~�� 0��-h���=m�n������b+��������������v��6��zV�,-~�(�"oyqLmoS�N�
�f���;���MN�~��r�����������"�NKW�!�m?�Z4����a�|no���b�v4r��n������\�b�t2b:�3d��������`��� �ED6�0��n�/���P�k��wz#
9L�c�y�'.��Y�a�����������U���hvQ�)�wfo1�� ����5�� u���R�����8W��J��2bk�K��C�NF���dY+;�,�{Q��u������]����d0K���q������d!�n	LF��C�~D9�����9���P�l(Y/�d��x4�:�����AC�s4�������XJ��D�����F�C\�1	����9������t�B�\��K6z��.���c�^��KXeiK+v=E,�������,�d<\��/�5��d��.��t4c���<�B��{eH�����~�C��L|�j��2����5��t4:A�|5���qD���k����_C�����b�iW���mN��W�J?��J��T���[��=Ci:�~�������`O)�
K��&�L�l:���XOF�����rt��]fT}�� �S���;�,�tJ��Ea���Pkt�
:8,��,��t�6����:���>��8���}a�G��7kE<��������$0L�������cw�k�V���/���G0��9+/U�N'���CVS�a|�J;5,����+���������/o����r�0A}����&X�(�����g�}4�#���N�/��b����>�jt��2��HN��+1"�����fSL����U����vo�����,�g��J���-K���%�#03��,�����&hxO�]�Q	<6���s��Y��/$�
J'3�~����pD����O@�)g��;��z��y�����C�=2�	�72�����������w�s��KdQ��R�g��H�
Q$���dX��v��I'uw�z\mn�
�yC*i����U"�L����"����\H��N�L���|�Y�����)�!i�=�%MG�\7��3(��\�C/q�� ��6:���x:�\�����Kv=;;�^_2�������;������|2��1#0�4�����F��Nym�;z��K�4�@tJ����L�����s5]�?�E	at�3rq::�~��!2�~f"
e|"�#��+�:���,0�M���#(����/"�a�~����mG��,�m�)n
�����qX�d���5������O;_��3�L?�S�,���Z=�P
������-t����x��@w�;��f�Wq*o�I��"��	�+�����Z�Lr���?}���Wi}���M���2}�����M�Ti_*�+O��v��<�Yh��-_��Y�~��\��!�f����/_lV�����q6+�D<}����q%|��8�����O�3�D?�qs�k�6�2�}9��tz���@fC�F���Sa����l�7jn���)�P��:��i��p4���:L��9A#}RH��LH�}}d�
7z@=@��GP�����;�����
��U�m7��v�����������������`� 0��f��~��N]�u�_�@� �%4	V�4y�g���a������X��E�V��l�����b��O�����:=��g��$�u�����I����t����������5��+�-�y�����0��V���9~�d�o�]����`�*5c���7�ZZ�����ME&o��$(�#�4�C�r�b;/�v+\Nj���E�!����+��.]<�d-�(0�_�a�j&j�*�j0Vs7J�4<����/a0:���Q�U���d
*�,
����*�JIl��%����v^��QvC��B���Y@B`�HdY�F �(dY@B �nd
Y��y��ms�<�Y�zK�5�b�������%�~�$>�4�+����y&(,����)7��9!������q�Z�C<r=?M���K.����NTm���J�����y�����w��^�.�)������De����,J���u�(����a.��L����%M�W7�V���'��S�����������.��)�K��`s�Y��hr�=$���q}�aB�.�7v� �v�?�h��M�4��v@,�{0��j�����`z����������$�����h�/[q�?�|o�O�Op��M�p�6�'�$��f���x�SI�������:=,��X����#���#����7>�V��|  ?������d�q���>�J�	�i��E,�^*
=���:B�2e.�8>��!�B�5���J0~�-tzc��B�>���F�S��Y�S�bD���d���3�Y��9P@��x�h.��I�G&����KO�B����N����5����A��������		v��:}�4����Q8Z�2�m�%���N�w����B��
d.4���}e9l�i�qj�<6E1~�����.�t�W���>�Tqc��9@C��x�����\�#i����Q���H�t���3�!����i/�O|���-������!'&���~���Jo�b9���F}���K���"��^)��D����xU�����N��3����|�/�����o{t/n�7��np7��h�]i���������U��r���U��cZI��������q[��,u�����K��&T��m7�+���a���^�"�*�:��~�xQpk|x����5%�"��+��n��^[��gx�?�����9�R6s��IiX�� Y�=(j��1��f�0��mBk����4t�<�r��FEp}B�Fa�N����KY�4�����!�S�U�{�~�5ur��5�7�D6�=�M;B���F�z�Z������������=j|�u�.�'�k��y����o�������������/yG��
�5}�t�l?DW�vn1��jC 2hews������C����Z��
j�8��R]C��!������U���t��*>[�j�1�'����)W��vI�0�g���g�/���������eD+�}���.��t�>N���OM^�����"��?�����qw<�J�S���<gs������J�����_�v��F���~���
��vad��]����l-L�y'i����3c�"�=9�3����:S�
N�q�m��B�������8K=����7�b����o9�i&�����uAY2g�S1�<�_L�����^��K�JfT�z�!�B���z�{��k����c1��s���8Ib��|P�	��H8)�4?�n�	S�[*�c���OQ�{���Q%���Lf-��X�O(��sn�y�:��3���3l�;o����]\]��
����iOk�Q��K�5�`{"�{G� b�]k�
��;�tBL������|��6a�E~��2i�������i$�Y;U�.�0T"��1���w�a���:F��kna�������X�t`?���@N����8/8*#��f8�nD
��I�g���P�f�%�M�g�L~�����(0���/	O��4I��t���o'���&�:�
����hQ������eNg��r��V}�����f��"�&�@�%jjC����;\,B�\L_��������x�C>	S�I���"�C>[:'�o�����E���-��j2h��Jg����z���$ks*�8�KG������-��9�+"sQ2C��D������k�a�j��;��@(�ad���KO�^>L�t�5���tz�����!w���/�/w1{�F���@z�^/�<�\vK
�id�T��&�M�o	���,j����y*����ua��`�g����,
��������,�y����Q��*��N/i%���^t�W��~�M�@����&�g��X��[��f��a���.������SQ�����b�*����(�&�]��]�a��'�5\������^{f���l�b�&�nk9!�}
�JTD���~pD!���"��^"��+�GI��������^����*P��:���t,2P����g
���M��=/��l�����R!GZ�sc��)�$�VS��,����N�&���,Z��K�����>���Y$%$��l�7k�����]����C���Y�AY��^��z�����5�:
U:
V�!����
o~���(*��|���`�.���D�
�����������z��ob�����J�~$�'���PK��6GK���::�_perf_reports/20GB_preload/ps_8_workers_20GB_preload_2.7_selectivity_8_task_queue_multiplier.txt�ko�F�{�H���t�c�+Uj{�>��U�VjU���Cc�p����;����B�4a?������;�v����������t��#�i+����a���h<�h2CM��e���������/�b�n"�64K��T3�������L-�2��F��qDac_���G��w�=����lDs�8"X~�^��j�~�����_��X������B���~��������'����������!_�p�������������Q��`-�Q|+��Lsd�S�>�c�~����$�p�f�/�}�X��l��+�E�b�l)?��[��9���Q^.����f�it���}7���)���������r�>.�"�<z�.H�l�����S������������r~z���o_}�+�"�W����s���_|�k~�lV�m9������7��H���]opYp�'�@�(��|���/�/�h����n6qt�}��?�G�xj����y�w)\^D���_����+����w��@m�o��h�ShE�������i���&�i����s�`��CZ� ���F������������>]\\p���~����..f�fL���P�D���Q&b��Y�G4d���i���w05}v���_��{�d��/Q�v�-��a��V�8�l����M.�DHQ����N����B9���o[��=���������]����������`-�oR���v��V��:ys�Wo�I6A��"���F�E(�����5#���M���m�4�u������v�����$��6�C,CwE�ER����G4�t�B������9RX7t]��vC�}�V�&���%��#�6wI����U��|���d�xD2�5�B"v�)��H�M����]
'�<'�f8Wk�Q9�	1�\o�bK�J�5Z�ao/t���X��oW"v��Q�o7��\�F�'�{�E��]Bs2�Y����m�\�m��������&mZ���q7����0�7-���i��\�O%��i��
]�G2��M�������_��'�4v�$H�Cu���t��yy#�+��~B��i���D�0��i���$�W��'���o<����3�jE����L�=���������P9��4�gSM��:�G������������jL�@* M�������u����@�| m>�7X��������C���b����c
83Al��H���a�y����I�dz|�W���������~��9��T$N,�EcPa-�=^�����3z��^{8l�0��0��]��A�zSU:Y��i������!���M��vg��h�e{�a)��i�+R��7W��-���L�,��A��m�u������J}P��#x����*���TZ"-���LK@�% ��i	X�DZ�vZ-e���W�E�����Z[�T��)11�����[��b�������w&�p�&%����<���F"�Z�vY��H�h-�4��v�b�S}�M[La�)CR���5�����&���5���=��Iv#_,{�X>�3���u�7&������]���al���=l�
��Vn��%��|BM��P�5�b��i��Y��6�!C���k)���v��P<�Sb��)��k%l�FL}�w&�`0pH+&��yQC�Rk)����@Z
�_�������O�a{m�.�����|�%o��Z8��I���Q�Z8j�����%��3�."���v��'�Jcd�Qe��e_���i�\.��b��|m������]�4�����)q�SpG�h �>������*9�
�)�UU���]A���>[�u�=F�5jd����a�R��H���R�'+��')D���1������ ��EL����$�^�L�5IiY���;������#J["�GX���.���%=CAt���FG
�M�t�+]UL7Q�E�t@}��nbIYBuV]�[.�����
�%k�9\]6z}xu�B�x���
�Sa^��;�BP�F��V���L�}.bj�u��E�oi�����*-&j4��#����y#�3d�p7����$����[�f��Hj���*`��
@U0k�,��a�N��Y^�Dax���@�l0���2&Xv-in����d�V(�z�b����%J��2*��X,	\>d�����X<=���D����|��St��
�
�����S*���/��\��h��J�������d���)�C%�|a
\a
J�T�5��5��5p�5�
�"������E�����;8{T���>�M������Kr�� K���k�+Qa���f�>���������E��=�7-$�%����U��+���.�P�/zNw���3��S����X~�Rb
a��_���0�n#��$DJ�����I�U���G���!���ZI!y��1�?��'��C��_l�PI�=�t��J��^�Q,��`�������>��w�p��;�!|��8���������"o1;b����@I������������Sa��|�����b���G��c��7f�����FP��~`����2�6��3�x�a��pI�HTI�$,.	�\��`�S�/���Nq.u&k�j��W��O��f���Q� Y���U��C��DUn�MB�R���o%&�f�S�t����i�;����e�������K����ov�Q����
�M^,��2�l\�� B�����"�e�C��(�q	N0B!��Nt���.�_�E��o����������(e�R�3��������U���6�4:�@S�r�RaXFUa�f�
��O�7P��Jy�����=�T"LbR���z�|����p��(�������	}t��o�;������1���qM�!V��L&�SL<��9�Qe��9���C���c<@�B����0*����-X�Q���+Z��~/��7��n��=�!�����2�-�}��(?�h�x(�����a�=K[��"f������s�G�`2�����[�~01f
�����iEk0�&h��������o�mR+it?+Z���b�����g�\�����
�����J��������>���!����
�]�8{�t���;��;.����
v%�)����D���MH��68��1[O����5�����G���/�!���S�>��q�D�r�f�������ka������@ZX�������+V�f�p����sQ���p��=2��������p��/.(n���'�s�}
��)�_K�%Q��g^x�	9E���Pf��=�@�C�t��������2Q��%���W�
*������3��ahf�ks/��00|���)X$�XH���H�3���X�rw65���N'@v:�����GDv+����G�dP��N������{�~��i�s��3��n/��B>�.|�|�|}���e+.($�����������I�i`
����uoZ�>���;�u�����<���:�uT������!(�b��������������JPh �GAyJy�
U�>��u]x�*��T�����;
J�z ��@�����Vd�ig_7.O�t�����i�����q��3�M��L���h�+g��O�V���5�s���)qaF9��}�J�����m��h�����~�����%1��KzI��<�\a7�!�{�����K���������4��Y��g�tu}(�~����[d8��U������>$EO��'������Mv���b�k'�C����=�L��
�-ES�p�h�w��5Ec�7�l�DM�A�:�,P�S�A8W"e�H�D�2L,Y����B�:���r�AMS3�N��pC�k{�OO��,��6��b�k��2Qaim�*!��	H���[;;��[�|$5�u�1��a��"P������ �AE�������T�D�����kB�����N
n��r�|�\��W�����M���a/�� �i��(�3��������'L�^������Z8�����vH
�+��W����Z	+b����+2�LN�����9�AOoPs�^OB��3j��������q����@�|`m>7�7H�/[�~��UZR.�X��	R w8���!k����n� up�d��j�0������ b�����
+���nK��
>��W"���4��+X���lC�g�K[a��a�+:(
���[K]�=\NT�JzXA%�����V��
�����J�-Q��1�&�J�m �GA�����Z�#��L6R��d5��>�6��O������*2�\�%�U�>�������e�:+����]"�db��o��:�?��N�t7�@�<�y
���g���������AU7_7K7W7�>�������nf1)%F.��M���}m5%60������rx����@v/�>eV��;��%�����ico��>���%���UBH��=i����F��=i�"����|=i�r��V��X��'�T����$ZykR}���qN������'�O����3����X�0U�C��Q�����g���2������x�>��2����&����<;����i�"/K\�����/;����,!�_�050p���AOz�,
wQY��o2���{��59hJ3�a����;	���_�����h�b�T�h��VH�e/����Or�!w�630F?06�����C?������&���.<R����BMlop#S�X�ZW�4�~p���x`\�sE=�7�����R�C�0�~0��Q
�0<G�8��p
�h0;�����EZ}Z-�5 �{���y| ��9���D-�C
b���z{+"�':��P�aj;K��*Fx����Q��K��3���}~�����K�5����C�Y02�7�c���
��QH��Pi�7��cC���+�&���Qr4;��6�]���F��&lUF�|0��k5!�hi0�
~)�m�u�]m�F����+������M;+*4��IV��GQ�?H=���M�	�{%`��;}��Wq]�!t��UBUqHV���8NZ��%l4&B��b��*.�?!��d�����sG=:��q���kla�9k��`��b�V�����(�����-�h_�F+���d��1��>��@n:k8�B"
&���0����s�����37�e�i��������M���s�������E�r�:�+V�E=�>���imT��V(��D�@��	�aQ�����K\G���	�����q��:������H_��M#9��Fr�~������m$�N����!��/r��+u�b�`{��@#�4�����g����F7c�M�u��HN�����r ���t�0�$y-��b�j��s$���@L|��������������T����g��fb�8jSRe���]�o�.�����*�,����>[%
�(X|I�H����o4�8�B��5|<����DWc�.7QT�pE(t=
]�B+�
�:Hxo(�7�7�7�
�
������_�D��q{I2E�Dp���h��9<�����q0�WA]	�?P�{
�=�>s���L>j���Yt��'(�j�M�Hw�nt��#��kF4R����I�fa�e5�S�?�pE�������b��]-�s�(q��8E��D�QO��<Et	��X
;6x\/<����Z<���K�6 x�~v��Z8�r��e��rl�i	Q�d�X����f?�J�����r��f%��Y�������f6�D���%�p���2�a���.]&�0jMvQX����7���q���d�h�'8DLvW��C2�]�&�LMv1�����e.�.�,Hv�x��"���]�Ez��$1G��Iv�������bhl
�k2�c1��i��[�X�.������a���3�E�z�{����6Z^gj]����%��}�����-��0��~����_�noVY���)�����&��{S/��^�M���z������z0�@t%n��/���^l���_������x��R��>��:D���}m���6����~�e��r�|��x"[��u�����_+~]�x����;^@�x����[�*VhTp�� L�����'�J*��	i�JH�YBZ��9^[�gmZ���J&La|���CRRld�
�� ���Vl�_!�9���>�z1���d���������b�'�<�����.OTLr�t�$��dLr�����$�2����S�O1<}�
<}�	<�&���,���2���&���SlO�,��oH3s���;V_��c�����d�6m��*���3
��0Q&�C?��'����D�
c,�K`Q��h��Gr����Mf�����j3��@��Xl�+��J����An^�����\�}�-�'*X/�����2�d��y�g&�i�n�����I'nb�;�=��I���8Y���\��E'W)����5Jts�%�����Ptb\����J,��?��`�8�9m�2��p����qd:���Z23-�Z���d
����h��������R_ >T�uixtc���U,�BQi����Z��i�f�nE7rn���_����O%�M��U`1B����no�>��v[�&IJF����z�K�J_�w�[>�F_�����~��L�I4����q	�m����}�}�>e���������x{�����������n�����o�C��]�C��`_F��k��Oqe��p~C����V&��M'�r�������k
��C���E:0�A1
dp��Bl��Gt�:��/��Q������������^�<8JUq�	�lx��o40C�����A���7�7�w�i#�)[d��Z�z������/V����/��r�����,�<]����y2����@����y�{���~t�s�]m��Y���#>a5�Q���c�~����oI��z ���y�.q�ym��Sfg���v`���$�
|AC
��l�Y��%�
6���\=�����o��2���EdP�(����!f�]]g�L�aE@T" Z^�aE@X����*w�1���Q���> ���O�}�G�t� ��P����9��)�g)x��<]����A'����&��`�M@�&�z�������z^��>�s��7\(�7�-���Y��������������>���"q1 Lb�Z*�W��O�IUW_[D���M'~��	�=:���������t����!��3W�r��}|�y�����vu��{&i�\��}��|02AL���������<��M\L��� �nV��<�*|�~��c�Z�������#?n�<_8F��\�k�z��!�f��6��^*�����y�v�h��P/�b����	�>j=}`�2��F�h"U����PH�C��D�Y8'C��������j3	�"�4������},�b�����*|�cn����J�c�[������.Z�vy%R�vyU�V�0�]�q�O�Wf�2��Q��7/�m{{���]M���-��ZZ4Zx-�������Be�����Z�%0q�������v<+"��
���$d�]	v������HOHO4HOx�',��-��	���Ez�"=����O�h��"�����"@�m�Y�����^w{���T��Yx�6xG������c��c��7{L1��j���v���Z.Z2i���>!y0suH"���,���rb��_���ld�����'���=A������-��k'�V�('�yhltx��n��'x���*d�&����<��Q��u�����_��($���D�m�����
���g���N�
�o%[������'������N�J�}-��-����:o��H�\����|�����l������@#V^�oYV��=�D�@�y�}�c%-c�4W��x:u���\��|����v�F3M�|_��tX�8���-9?�e�7����OB3��5e�L�@_��,_�����a �*{����)�F/<;�2EQ�X��3>�I[m�	�M��&��aY�oO�fw����f�D�X�k0���Ht*�����t'�����a/����g5a��
���� ��+��6��A��L��W��b2F��[��u�K:�Kf
&�J*C��{�kM|�F�-$�F|�d�BP��i'��rh!|<�c�[�t/FC��q��,��l�F:3E�E��w�J������1<JT:�}GL:\�D�3�X��J�tes�dy��C��%����9t
7���eO�������"o<�������HM����8���
T`�q��C8�<,�P�A���4��"##���\� ~��Y���<�Z,���O��Si�!v���J��2�����vttz������G���[����%f�5Rp��5rb����m����XL�;h���yu��]Mf~X�.D������~��7����M�Wj%I�+v_�_<v'()��B��{�7G���������yb�.������#B��A~�.����:��Vex��sc�uF�e�I�=c��f-q��,"�?��TFQ*�%R��?X���<�����M��x���dR�m]#I�Z~��P��V�S��Bs�k^��)R�4����{�
�t���5���~oVe%��>��~!�.ch
�%1����$l/	�K2�%���x{9��\��o������v�xo���[O�g����t��2�������P$��zc��q���R,2���LI�����/���KU�!��|'����_W��������:=gl%74ytcp8���a�����]U����m_l��1����HS*���WZ�����)�{J������S���{J���Ck"���UQ/�K�w�����u5sBg�W��n�B��o<�nd��O:���S�wzJMnL��I������-�db������t�:�_`J�Q�t8m���Xo��5��9Y�`*!�?�������/^��P�J @	�7���P$\�{�n�,�r��g2C��i��G���x"�����sv�7�Y�]��D�}I��l&�Aku�\���A���c:��6��c�#H���f6���l*�R"�������������o�u��.wg�0��q��}1��*��o�p�����I'2�z��"�������n�������/Ch�@�`n���}�/����I�*�t�G��
�_5e���n���#W�0������������5�Y��N��~8�.�J
n�p�dp�������g�d��a�j����:rh3	#���}���A;p��-d���A���
�(w�z�V<�a����(����YK�����L����������w��aP]��������r�����:���X�[�n���?�����L7���'x�	�|�6���'����d�|�6�zU�����e��BU��MY�v���3��1M��	���+L9����"P5�0�<B�G ���#x�p�x���#x4��^��������@Y�0�@)�HL�;?)|�$�[j'm%�k�Gc�O���H��\�]C��p�[�11���<i8N�3��D�����_%S1��*��P<�0��2u"I��	ZX��XK���ya/�*G��!����j���
�U��0��e���5c|\eeLE��x�����>6��h��&XR�A�F"��^��ND_>"`'�����z�<!$�@�tHJ|	���O�Oe��^����7�rX=��3�n�7�y�Lw��9�3�����*X���7��Bv�H;��Y�	l���y��#,\
��b�,���!�"��	��0-���1$?���,m|}��&�
�����B�HD�XA/E����x_<�B<�F<�/����������A\���f"��p��blo������E����5L������?��DA��Aoq��G�`(BI�r�`��� .w��a�b��%T���D���������;q��x�pA��:{�y�a�kJZ��M��#X�;x�`QJ�.x�`N���.��i������E"9��qT:������k��Z�+U ���2r�t��`n~��5���X�����F��5�<�����������e�
FC����v���e�4��g�t��50�PK�� �f������}-+F~�a��IB����e���Ka����<l���w����?n����*C,i�{9u��]�g�J;��G7��O�0�������]�����-���K.���R���j�/e��FO����DS_�����t�:�y	��P��;Q���w��c����[�^��4���MQ5�o��)���lc>o�6�w7�����=p����4�}�w�|�����
������_a�5��Xg83���I_=8�m��u����LL���;qG(����#w�� ��;b��@��w��q��*/�ZT�k�~C:Cpm�j��F��C���r-c�j�a�:�(-JD�d�x�y!��sS|��+y##/�����/F������i��Ext����D����t�����
����b��^W��!����>2+���[C�e	�W?�����Y���><	�:�5�o�U}�o�g(��
-��/v���������
|����������7Wi>��4A��{�!��f���k��8�����gQ��yU~(uoQ��OwU������`Q[:����i&������Q�1�E���V[��T|Z���y���s��b.��L~}��U�J��?����N&���+���`�pgq���ojI�[������������Xw���$�U[x��_E��q��"�����\�K�+�j'o
�Q
���`�W)�����b����z:=c�N�Rn���v��):@H�OM�c��I7��!R%p���0��;BqG ����#w��p�w����#w��[v���t��S����.��o���L���y��uEJ��"���/3V<��F^(���c�T���H.M�h�G��
�-�F&���BR�Bc'?��d!U<c�[�0���MQ2N2&��\�+���"=��|��}�y_�O����0a���j����Y�ob�����0i����.�\jz%�MG�����+�nt%���\�uA�����9?���4��y�kt�u7��a-��fX�J����e�X�bE�)�B1�9���a[���Z�`�3�)=ppr�7�K�GEo))����H�.{��N�!�=96:�������t(�F����!a���/��E��:�c
�8�
��R���jc��R��\1`�^-�>O��a����X�*	��������5�U��O�xJ�
����%����WO�.��z��/�l+W�P����������vQM3��������6f��B�k+�ma������[�xs���l'���h��v����;_1B	U����BB��E�xC���m�IJ.$>O�4i��3N�-K��u�����x<��?��<zB����@Uyp�uj���u��		.	|��y��y��y�<�e�5���)[�>&&��g�oV?��'P�2�� '#�g*�i�h�A��Kit�Wo>���
��_l�H���s0G�����>�����8�{���:���5�X����a���Ou1Ip�$G]*��$qu�L������f�qH�s�f�'���)'?7�a�sV.��w��(�)����������6��Y�Z�a���~t�&0�����	�.qx����.T]��0^U2�<��ft����}�ze����a:�23t�$��N�������r�1�ax����A�M4��W�x�
1��I�3�VN~�.�T[]�nuqH�m6W]��p*�}��U�b'6��Vp�x4yhe2 �P9�DTv2%���D>@g`����$��;
��
�	h��C��B.E��J��z������ED�������T����/��|�dY�Pv�0j�$�"b��4�W������Q�x��y;le��Rx�s<��9w�-g<U�*i�`��f�i�:0��p�=:�0 ��t�����j-�.=e�;�c���4�����<��\��T +�J�R��T`)�J�R��TUNK1��+0��M�32���o�E�R�gO���)�����j���"���T�CV��Yo�Z�^�My"�D�t�3c�6�����&T]F��3�J��`_e~�ctqh&u��i�������wE�"h�a��jLr&#^�Q����E���^����L�{��O��s�h4;D�OqR��p<s��������f�bK���������~)L�xX�6RPZn��UL����H3A���/��(�Xxb�q{������[���5��g��d�������v��_m9 �`
�P���8Y.pD!r��D�^��������T���Y�a^}�	�O������������i��o�9����w�$�%T����Jy�Y,��S�XB�p��3�I/��W�}Y�.�����9��$��F����'��Y�&^*�.����p��1��`r�������x���o�w��~\^4k�����SU,��)�!�yy���/O��=���%kK���4�2.�4J9�}8��<��L]�X��2�fOC�V��a���i`���bv��f�v�h�����p�O�Q+5uJ�P�,K���WK����aMphPi�
��k�������t��T�j����2)0�%��WUp�h���KD�����xf�^�@{t�#������BSg�����G���L^�8Qe�p�H�W/3�J���D�q���p2��)��P��	�������/*�)�*l�#L�	�U�PEb�,Z������a��{Ut_�|�B:��������Ba�)H���Q��jJ.{S&a��A]C\��A}��?[)��0��>��ah���?�z=���^R�:Q_��4�����K^��W�s:���+����P���]��N����)��6�6�t�o����������m�0�[�>��>M���z���AWEI�V��!�MG�	�9N�����#��X�������&�����I�F�Eh�'��|v^���Y��%k����P�d��-�0����6^C�ZP����+���g��i�Q[�%���h�4n
������������n&v���=�p���5E�j9��(t��������*��,�������[n��3h���K���x�_�",��b�����,�>Z��~����b��/7�t�L������;^�x[�yqh�Ig��@����GK�[���Z�}����N�(����X+9�CB9 �Z�*��� |�\�:�F����������0�
|eO�@W6��
De�]�@R�.D������N�Q�\��S�am�C�y4�Q�Hf�����8�R�1at6��������������$��1^�)g��*��9�5��X��Dr��TAA��N�sS�9���sR��xzM	�:+.X @Q����	��iZi��N�D9�9����q�uE��J!��^o�����@�m��6�z�i�q^\k�b��R��!���H9�3^_Y��N&(�9�)��x�[���^/�A�P�i�3z9{/g���r��-�_��!d�r,'q����v�r�(Px����
"Z�d�H3T6��e�QkO�:z~�|+nZ����K
���(�(�y���3Nkzl��_fD������)��'\+[m=\(S�9��~���e7n���';����B�J��=K�Q�E���@�p�t8�;���y^������2�1�b�c������
�wQ����c�#@�H�:6��
``��6��
`a��;6��
�3�:��jg<�����T�S(zSN�j�?IzG5�*��,��5�i7���;�6M��+;--�0�@[�]�*eaA� �a�2��@c�����_<T��L_u�F���s"y��%���$�P����!#-Q	�@U4�b�!�f'�4K<?���b��Z����A�J��L��F�4�%����Q��3��e28j��U�ihKN��)w2����"�}$N�2����b����@��gT�C�
���i�i0�qTN�v'�����Y��/�oy S_F�yI�v%����������c�#�	`��������|��E"�
��<*����U���}��c���k�f������
aX~n�������O�|d��fa��+�y���x\�m���-�-���������I����7G�X���q�M,P�����9��;p��?Pp��k�>�oD���y���+�\��z�&4���a�n���_����#���s�!��&�y����-�X��9G�o���K.�o�q�U
o�=�����BU�����������\v@W��<|Y��������sp}�;���(������~�d�F~'�=�8�f�3D���?�6.�u�������kw�P��,ni����|DA(6��jP��O�*N�X�i����������?���J��c�`�X��)�����6�$:���zm9]tE���N��z>J�*�#Y�ES��������~r@�}U���e0r��Q{�������<����?"/�K������J��z���u���r����K�L���P9��)<4�N<��*N��X��ys�?��o��H�54G�-9Z�JJ/iB�pZ��3�0��j�������t)<�������������B�D�Bva*#�����mn��� �*�#����=O.���{������4�M��{{� �������\���,�h����w��t=���F8��P�S�F��s�P��C��NWy-�Z��������qn���Vk��F�lj�����.fh<)<u��m�dJL��A�*��H�;J�oPK��6Gn��fG�{_perf_reports/20GB_preload/ps_8_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt�]{o�D��O1:�����$$^'@<q ��\g��&v�vJ���3k'����ui���h�dw����ogg��.|���y�dQ/K1�"?.�p]p�n�"�����]U�'s/ Y�.~YTP��H*��k����y��r=���,�nFa��r!�)\�����3�5������%������bYa1gY���n�L��l�����*��_��������(r��������6��1#�����6S6�,_�|�������~�<�_��Nf0�����}����W������L.����,��y�R�E9��8[@��i���kL�T�G���3Q=��v!?��!-�iv!�o�>9���Gq��b|$~^�e��Q6�����h>;�a���*g���l�������_�����?�X_�������/~�,�mM'�3����g_a��$���v��������q����u2_`���J������k+�=L���"-����d�(����xd;�q4r^��]�d��y������2�����w��@�^����>�Z]�{wx���=����=X`{^���U���YW�lV������'�
0]z)&q�TW�o�����w>�����Y��I��=���lZA�E`��*F��E����T��q]�$W"^.:�R�U���#]�]��� 0-���"nst���J���t�T��N\,�����,�r��l6���/]u[�2mw�Z��4�u��_�q#��*�f��KM�i| 6>P����R������@l|�o| 5>o�6�g����e���k9�UG$�j�&��tq�W7m��0�P�gy�����(�~/���#�Za��W$����vpI:�Y��t���o{'�I�h���|��q��a������xc+���i�'�����#.A3.��i\�i����,�&GW�du�Y�%))��h������������]n_�����}�E���c+t�e���~tZ�������;��������YQ�1-�y\/���0��@��L���4��a��k�2��a���ukt�>ux+�Q��$/����uX�w@�\���5����'�������=���v�d��I��,I��:�{���P�M�O�bBo_��u	��D]�.Y���K@�%���t	����$]���U�kR]&�%����d�e�|q���[��U;�04�o�
��o&Qh;��:��&��y�v�;�v-;*��3���w�gI;�Pbsmx�q[�f�yu9��o������U\����b��Q9�r9��=�F3����$T�u �?�l��{�X:�9V4�b	�+_�E��q����N����a*,��V�����0z��\��*�"���:���#uqq^���T��"��M7��7n���{}�5��������r:e�YK�{$�P�#=XTf��7�y)odMCM����m���fCC@^�N<��O��#�&�L�
��B�J��@�lV�devV jg�����J���*��G�jLp�ogD���l��	AT��-��s��������Lu�G2z|�I=�,�GH:��MSC0�y)�����[c����>-W7'�Qk2�J�/D.J2���AM���LA�x���!��t�j��NG_4�.J#Wb5�I��*����*����m������Y��P��|�1�#�j��m�G*���Sb���<M�N'����J�T
*�]�hk�3���Lhbv�#�<wc~6Zj���S)��U��J�c��xGN�r�� /eQu�����A�r��8T�l_�������`*0>�P����0�8
�M����%��$N�+���clk9w%	��h����$
�#���;����w��:v<+v��0��z���l�
Q�q2�{�8�/� ��n�
<p� ���X���`eag0���O����Y����wD�st6����MV�|����\|����T�[�q3�-]���4M��'�U����].&����RQUq�����:K�Z�jdM�SJY.������p�EY.���j���N�P�����mG���]��H�����{�@���<��e�t�-��{ou�@c+{^;�9�N���D�HG*;/�
,��KF�X�-��\s'��0w���	���00w��-�-kXv��(
�1��`�"f@��p�D�I��%�=�C��B�"�5��Y�����d|���A�h�tp�!R�:����X���'��@����
���x���������z@�nO��cwp�.	����
�x�;&�H��r5�w~�����OPp~�����O�:?A�#���]&��I�v���U��cMOg�f^�<7������/�
�;��*\`s�j��l��\
z����R!w���"��0�9��T���MQ|����>�18��k)e_w��8�FM�@�h�\��%�a��ob��Y�h������]{do����������K�Y;����z7������&#>"�pz�F�0 V�W�*lu�����?����e�DP78���K��[�'r����n����)���:��Y�c3�9���K�p��MR��G0�z-gS�~m
~��A'��:����-��T�����R��b�U�@�X��X U��:E���E��I��DA�D����/������t�����8�g�u�e71�i����r;����l
��r�H�l��Mu9�i!;	�;)wz��b:I8����O�v������m��2�t��"������$���6%�P�C�;�3~���Fxb�Vp��, i����S����A�.��;�;Jv�I���B��f��<7��������Yl�V���k-71<y�'�*m\@��U�����J��~�U�&i��Z
+-5k���(�����*����Q�D����}��T��,��)� Z��`���
��hXv�5����|���o���ly��Q[Zh�8���Vv�	=J���ap�a��)����l5<�,p������^n�x��
<	h���lE��E�@o_
L|&&&>���T2��k��q
�N��x���q�wp��g�O����p���L�Y����� �
���8�S���j�Hk|��s�3������Lj�, �+�����y���!9:����Y�� �=����#��/�k8 X�7S����@J*L�D?
�����.����#<�`3�&�&~�#�����d`��a��S���K�g���@a�c<���v�r{��DY�~5*���2������bA�D�YY�<���	��[�r�52g��^&�^����
�l�w�Y�8��|'����=����n1�����4U$�a��$#L���m#�%��4��[��'�{65C9���\\nPW_+����>��5���k����z��.;�������2du���JB7a�v�X�[(T���jB7@������R����Ln�".w^&�e����uk-G�(���q��V���s��-gS�^Ew��i����f����v�Na���a���d[t��S��+��G ���m{���Sr�S�q�Y�c�B�@\�����a����L����9*��m���� ���
x��"����=���CE�
<N�C�P��1S�9Z�~�2%
�1�\����;�}H6
9� ����|dSqK+�TX�R1�Cj|�@EC�x�x���`����m��
���]U����T��/�L��8!���dr���Z�`y��Y���e���������)]�u���N�*yZzO$������R�5��1z���d���B�V�@R��T�R�4U�S�JX�M�j\�L�
�#c6/�(Z�^��e�2�!�k=xx<"x`���
�_k���]C�O�~����Ws��tF�ag,@P�~2���%���P*,i����
5�h��!��i/�66���&�k�5��^��]�k�o?�i�{>��IUl;=��������=�� h���k�����]2�����j;�2����K2��}Fd���3��!�6�]�J>DI�vo��g�R�]����~������l+
�W����6��}7nt�� ����p���R�B��oE����Y4���|����GRQ�g%�h%��z�8z�Kc������j�O>����\��96���j��o��(��{�E�c���J��e%oW�YIV��p����Z�O��\��*����Z����#����C5����z������s������F�+i��;�a9�1���y~�7:ZI���g%y3��;��x%MF���5ut�Z�sc��9��]��,TcY��s��7*�����c��Z��byX?�)��JO�7�Q�k��t�;=��I��D��������G~;R3����Y�j�����'�o���ZJ�Zv�?->}]���������K:�\��.��@"�\/���&Z�%�V��)�3�>��*�!D{��%n�:�s����"���+�]�Y�����r��oW9u
�\9u���E��/^��L�D�>�g����(5I6�R��h���D���'��f����5=�v	y�� �v	)B�K�!�]�
�v�c��CeF���������3�.�z��;�.���:$�����C�)$�:�����:{����������i��U���K
)r��P	�|~�h�����b�e���J.3������b����w<bL�|��oq���b��a��2(m���m_��OH��m��<�_�5����x:|��'��I�������tUd�� 
��}�
Q9�;�O`����s���]�!UC( �.v*7u���,`H�����`�����������j��i[��?�>�Q�#YU^�������%���!Z��8��,L��0��8�8�868{��9�([�5����y_9vYQC��s�T��UE�U��yE��{�6��0e�������r���� Qf��
��Y���`���?:D`������89`���Ar,QHI�@�S$�@�!�@C&�@c
4yef�b�B?3|@x�
��a��n����,W�Q��Q������U<e��:�D��8,�����/#"���Fq��>�(|u05��_j6�>0���MX/5�X���O6!��c������@�����1��k����W��� J�D��(]���s24��ZyZhL���F�B�hA�X#��<7m2��0 O.�6���"a��25VI����>�V�P�:�-V�M
�|����$K�;��y(�?�3�����Q����
�u�3�<�oX���|����\E�T���a�U��S�h��m�.A��]�F�4u�v�c���D�Y��,((s�p4R4���)��Y(�]�]gor������xy1��x0�f���%��K����4�w0IcMT�.�}��t����Qh��'��t���^QN���d��������}t�n�2[�R�e�$R��"J���L
X�Qr5 NHRh���&M��a�M��hB���M��X9`c��ff��fi��fc����6VNjV���:�����V��OqH|&�w$��D�@���-�d��
��'�l���V�g�|�����d�w/+����=���g��p�nhb������a��0N��m���s��p�Ku4!/���FyfIFY�;&��.LK�AK�_@K���K��
�b���;���)g��b�r�8�AD��������ye�7�8���U�|�c��&S��h*f�x�����m���~4��h��Q�����,�
��S��d�Uf�kFh7�'���v�����L���Zm@�,R0��t������&?�55�����Nq+��a��/w���#����p�1��wB�1.���U�X)b\��v�1.��o��m�)��r������#���A�Ms����k{��EEP��W��R��U��U�"�/�,Jb\�O�:��
��(�aCQ��E#AFH�#$$	|��x�H<@*�`�p���n��>��D�����xJ���^��0�+4�UF6�8����ZZd��jS�
)��D��4%���<uH���G�v4�00�<lgngn���s�;C�gY9�����2M�WS���~�Yh
����>�7�j�+������'����6������������r�Xk�.=�n�=L�-�t~-������,��8Gp�2��PF��Z�m��
���3LQ�"��@��@'�<�	��
�2|C�5y��R���#��s�o�C���Aq|$E�$$a,�VD��]��"�Ovvy�R�2{���W�du7�n��Nn����mY�����BTuR+`�,k�Z"oS�P���^��8����L���
��ow��������DZ�����$���LizL;�\�y�K������IE��1����{������OBY�c
����,I��Cf��nR�s��\���u{|;h�H:�95����/���
c���.tGQ�Bsj�6�w��kQ�EIA��6NP&��������%u��:v���N ���"c�Y�I�@6�!F��zc�"��E� 1�o,�X��Rc��"��E�����
|���v���������j��k:+���M?��j�v;0�n�;+d�:�d���ku���x�b(J���,�TX �!j�d>zJ�1�;��$��a�5���;@qqY��\RU���c��P��,���H�>x�1�B�)�8K8<����o+�q*���,�$X�?Z������g��6xP{��6������M��I��E���c��\P��B�Kr��6S�������&�]Q�$:B������	U���5c���
O+�4�����`h��[����(a���_���X��![>�	@I�D"�`{�]��"�x��
h|ex��������������r�c��&3��b|v���~`�W�}J���ybu=��Q��)��0E���]Y��0�+�����xD	�xA�
m����4�����I���c(`?Y���I����WHx�A4Mpd�a'�4�q��O�GkL���]q�ss�
@j9eQH���/�9��tx�g�W�������c�i3�&{E��+�=����55���w�=����$�*�U!�U�3��3'%sL�2!K���)`��=��]�����t}$HLz�������:�|���f�����43�2
�NQ���NDf��'�PvZN;L�v��H�3��)�o��Zm@��U��M������v�	��S0��N��O;��m����S�]�Nq^;u*�:�<�J2Nv��vj���N����N��S�h&U4�H�I7����v
���S(��)��
����QC��I#������E*��C���e*�����"���4'"I&�L�I�%n�+�o�2rzb6tr�;�ZJ�rkOK����mj	MD���N���%g`q�w0<�k��%��[�z#��O���R��
>o��d�h��f%[E��XsMD�Z��6�9��
����a�h�=�1��J0<��v�U��5���}V�n�����:�Tdj�����"KO�����yl;������c�p^�<c7��7�qHl��T%��+����M�%���
��0�Bq8�W(����aw��`���fm����v��j��j���'�j-qrk���/��I2k�����i,|w[�I���L�����1�m�c���+N
�3a�#l�%��te��f��dg'�j���2U&9)���A���Q�v2y�a�N�a�8b�1rt&��m�\
a�^,�4�"��;�p�yXF����P:84�����m�}�}��n)�1������������l���I�Dta���&l�t���N��h�"�q��
5j�%��j����u���M�fq���r����ww�w��Z�)l���tp��kfd���1'Sl�f>��^������Oc����_6����?l����7u	�.4Q������]Q��F�H�H�����?���6r����1G*y��v�Gg{�\{�@b�����y��g.��Et/W�����ZWG����[j�T��/^]b,P��BR%!7Ur�k9�Tu�p�Z~,��o���`�P�s�+NL��uU�C\�::��/�,
��C�$)�*����(9����|);��J�~9�\�V����s�N��6��j
���$�N����:�	c����T����b�X�?]}_����V��D��q������B�������x��4���8;�4��$,�����7u�QYx����a����(i�~����U9���XMSoxt�x����z����k�\���|]��j���4��������o�,���$	�
�E���!.���;Ya�����yb�Pk���9�������*�� �*�/U��}�\m�Su��/8���uU����{]E�������1�W5i����w+=~t���X�"Sj��ZFl�����rK�Q�����t,�i��'*w���y$k�i78���9��\�J�v-����������u�N��7X�����rg�ru���&����Y��!#����+b�0�PBJ @	(�%��p����%�j�����t��������P������w�P���V���'�x9�cR>���s9��T"�\y�2T�j��RU���n]Z�4C:����lV|~;+����5;�tz�Lv��f�&C����f��@����}c���n�p�w?���ywa�SlJZ��@�W�4E�����oi{>."B�EK���]�>����I�����z�&.�#�������s��S7=���"/���9������w�Nk���v����4�H�_U�id�4����D[�4�-*��zk�1�V����_r$6���F����X���������E�������8kT��h
\���,r����0'���3;:��\��0���Ia�M9?���u[�r����sR*	��:~�d����JCi���~���g"�F�$�
�,�+�J�^)+��J6�$9��?B����/��FO��^�O����<��g�K����>��O����������s��;�kUKm�F�tzf�kD�����;x 6��D�L�J�
3���]�z�,:{��h��7KY�������n��X0����g��T����r;�s�)`W��r�I%9J���X�F<�a�5��&��L����3�a"B�P���'t�	^}W���'|�����>�W������������e3kK��T��N��IY������#��a�jG�;grWB�+tR BWB��
v9�!u��x�M�d �qJ���j0Y�OC�����C�?=M?)�c�$U�z�(�y�K��bw�8O�c#�P����"�/����B2���%8�����.gt�NH9�\>g��D��� !$���`��@B8H�ABz����`�z�����,|����@@�a��,A�?Q��O�b���+�}[(��]�)��V)��w�&��C�L}7�f��`U�t�!|����.5�����&�]��b�@�,k�rG�i��@$���QI*�������]I�z�g/��y�"}���RI�~1n��Pf�	���d���U�9#����5�w�`��Yf���-���XMA��X�t�4m�MwF�"N�9N�?7�T19�b�B�u����108&���X$�q2�S�&��z�����.���'�0X/�Z������ sW��� �!�
�T�}\�����w���>j�/`M�'��twu����OD�"f��[���������������(E	�s��[(7L���@��Z_�:���g�=S�#��-�`�!��%@���D�s*����"]7.�5�/p�\�\9���KD�N��qz�����$��D~ak��7oB��A=����;��r��Z�u;���!���7�D���-'�M�a
`���&6���jb���PM�������V�Z1T�\�}Fj�^H�s|S��P�do��$����We�>��w�w]W�r�i_}���������l~m9-�+�����r��������a�y�w�J��Z��/�����q�9����qK�g�������]�O:�va�'��%������K���q�y�]5�hr���H���j���b �w�^�}0�>���1$�0�&P�	�h%�P�&X�	�h�$�p����%��������_m��YS�U����9�c���e]������6?�t������I*p�{"�y����Vs�z]��:���T��<��_��/H��c\q�����eS������l]�?�����:=g�^�o��E�AWG��9�aA	E&�����gI��q�����������.�W�]{�!��Q�b��c�.�����<q�.�G��fv��j+3��9������a�a�k9�������������awO`��a'����/OR�$�3?����m �+��WB�d�~k��@���@QZY�u�+9��M�~|I],���gHo�M��M���<s�p8��wl*�"��Z����@��;qG(����#w���X�#w��A����9c�{L�]U����.@e�t,�����2��E�i.������H��u�R.�4����X~�����|����]����_���]r� t#N�p��3�����;���e��E�/�!d��Ha�>t��UU���J=�T��)�����m����b�������"�3�����(��,T���������������UDA����Ov?���C�)��)k��X��^|&e%o>-������|�=��W���7�ix����L��lvEb����1uN�)=���j�8���y��l9�:���Uh��A����������j�[n~o8�R/���=��SF�V�Z:(�����~#�@f�C����=v���v������N������������#��5�2��.�������;#�`�Y��>���t��t,���o�E������v���?�&��w���T����2B����7��F��xPVKb�8������o2��6���c�{���JK������P)�.���;i��pwM������oq��z����t����)����N����f G��\�3;k��6��������jH�g� �s#��+3�t���kyn/�v���t)|_L��b�3,D�����-D0u%K^��D"
]}/=V���<_,<���9W��No���?�^����>v�a�f"f�FT��������,}q�v�>aGMdDX�t0���"5����ccNw��mS+
,�VE�h\=?t��.����V��`�s�a�/�j�p�&�H(��i�D�FE�&�*���7&���^@xs@��w�O8^}A��YF�d�1>����t�y�!�mR����Y���&-�hR��}���z��&��P�����H�f��%�F��G���vA�J�s*5��E8����_�Y��	����/�]���=DP6a�#x��@�
<��G ���#���G��
@S7����������N�����Q5���X�g\��e!��L�W�I�'t�I��\x���}_�e������?��rX�=����aEQ�����:<�*����<A�\�����/��Dx�u������]��	�X18��aN3&�3����v6��c���9&l���a�<���8�y���'/E<�/4�m2�C�� #�)�bA�� .�.#k���U�������:��C���p��b��@+�(.0b4��8��<.�y�(�'�c|��(�vp��J;P�Z���I�1�cj�k2�d�����P������B)�{�\�z�q;���_��;�I�z�.m.C��X���.�#~::P�>�-��?�����l���F=�P��k?g~�����|x�w���=�&3��{�//�n<�}^�Mr{�����_�g��.4��a�c� �"6���bA`zW@�� ���A0��n��a��
����7��N���������_��!��k�2����������AF�A��A|� �xW<�+�\-K��alc/y���U���mJ��R
�!��`h��@��0
���[�4��H?4����i"Q��/R�*:����G������Z~���M�#���Ah@0py�/�w��x�V���9���~����C��R�S�.��������|��N&��D,0&��� 0K�7���OU4������/��<��7�5��2N�����g�"�@�|+��7z��\���`�������N�/���j	C(�%�"�P��PJB	G(�J BI�P�����J��K��o��*���
�,��k����z���0|���x�^~����dC���N�
�G��u��d����M?�J���G���}L_�A���=D������x'�Y����s'���'As�\5g�T_�4���������kTm��T#V�t�8�C����a�kC� �������0;���9��|�od	��>��
c�a`@x��mV�V�ds����\�e#����^L;8u���-i����0�V*	�d�'�	r��~%�d�z2����]R<m���2K�q�KB�xD�i�8���`uM��&H1����KHz��*5pp�T����R
c��|��H���d�F��t����JShG�&����mr������]�_��Oo$|�5����D,��Ta��IQG���M2����/@a$���\�����9�pB�P1p�����T�T@�uQ1:�)-���y�0V_�������������5Y>&���'�R�n%Hw������C����Ui�?��l���$��U��VE�W���D������q����e�y{�U?j��_��<�
2�S��P��z�?f�WI��
�>��p�*G�#1�#��#���p�r'7;�S�.����l_M��g���ID�^�.2c������A@���1}�)�1����7M����|X�v�"��}A�{f"�Z:�MS�������=�������	?r����(���o��Q������G�mI�KR~������=J��/���I)� ���$W���X9���t,��~�|q�X������\,Mg�Sep��?��2��qJ���|�W��]n�b�D��������#f�R�TV�������#)�0{B�FJVk���rg�0�X:�a8*��\q�-+6����2\�����t<��?�z�&"T��U2RR+�P������;����o��Z���!��}Lp�=��Ki�eS������W�~��a������:0=)i�H�%���F<I��F�S�M���W��,�SO~�/���_�udJ��
���.�lM�z�m�;�&[mH����iH���p���9h�0Ha8u�>��l���z��W������F��m���"�U�2��42x���b��R��u�Z#�w�X�e:�5 �����l[�Q?8�������|������3�Q78<���P�@a5�+���g�e�����c�ODd�8�� ���������|��]g�U����:�e��3��.���B>���+K?��b�O��0q4�:��k��R��H=h�=�����
/��>��i�8����[��DL��-2X��J_�}�l�J�>�'9X�$z��+j�{`���#�����4Z�'�Gz1#H��L�z1����C:&�`�L��t��m,���X=�3��p����`�G�YC��@����x(�+C���m8Sn�2 Uv0^������o)�}x.��05���]�����iF�Y��4�3��]�i���Y���A��G��&UU2(PCA�����{�x�N�_��Y�N6����M&3V�uZ}��`���3��� G�g�� ������}U����b���z�����OT���l��k!YM�M��8��Kt�+�y���,�E����(I'�T&����=�U��������T��R�P$�"��87
��0��i���/f���Vw�??���p��g���<�7�]"��2/4{�M���K�E�����2���ZC#t�������z��]r�o���v��w��e����������U�Q�L�f ��+�|�*u����i��\SC�gZ"C#A��D�A��"D�E/������>'$2 ��
.Q���3�7��o�ir�ir�59������p#�������?���&�V�#��L-�-�-q-i��{��z�e��/�^��'��]1�a���S:6��S=�xB���?
�W��(�Q�i���E9P�%��Pe��{��:}L���z_.��S�=��*VR'}"%���O_:��C�sV�#p���v
��`�����+����5L�������V�/�d�6�����	�Y�}�l�4=$]�������2�U��URV����ub��\����`�6l%P��_�YV���hp�<�c��IeD$���r�����<��04J����~Uz��l�%��u��5�c���I������P����N���g����X ����q�2}1
M7,��Y[��%��p������\�	nA�@�&k��oA� �u��&�SyP'��L���H�	���h9�E�uWh���X�)`���+�}���|�H	$@0n�@7��*�6��z������#]�d���L�|�q�+w6��gZ�Vmpak����������'��!�]D�E|��N���yf0�>NB��C�z�{O/�S��2c�-%�
/`ae�����!��G�d��S��X�]�����l��ZG�O��6dr��Y���SZn4�E{7�	eib"o�_$q��]I����E;�7�7����L"��uR������@A�db_./��_��K��eB{_����V�}�2m�3�x��e
7�G*g����93r��iE#_HP��@}!�BE#��E@�"pd�YdY�,I��. @�e�F�q$�C�t ��]��xW)4��a�,U*�O�%�p ��AM�����dzW�����up�8_�v�h
^��������d�"3	��������c@9�9}�aB����&���������Um�1��B�s���fp�)���U.�����N�����ik�^���.T������yVqL�Y�������!K�W��iUU�)���|
I�q&�A��%���y��j�&I���B�i)"*�d�����b7}fo7���P��X&{g�5��������������|q��~�T������d5������<2��'��!C��_���{P�_8��Q�_r�X���_��N�;�>������&y8xg���E��&X�-V���[��M�6p�Cn��(.�� ���h�����5��dj��f�Sx�����
f�{�������)����0�7t�[���v�u��g5��m�G��Xa��5BH8�@��l��V��m�l��.��4�	�DC�' �[������a����7��\H2 ���q�
'�iT��h6����WC�e��m�,����F���U��g�
����A{��;"��;�F������O[7��;��`-����v!\A�
�d0�����v��i����������6��������}PE���G�'	%��V����7�����7���dU������H���vh�y��$B�Z��=b������
�>�w�7\�:w��\�W
r�M��)|��c8���?S�y���b��b�����'�]E����hByMPg�	�w4!\za��������-l	MB�B�uG�N0�����
�F�� aQ�9���Ah��xps�pk��N�S�y,&+,7,��}�l>P��������;�qn��Q>���A�����O����U��c��<��X�&���$a�A����B��Uub�C
k��%E����		g�1��A�5����U_�u6O��Q>�U���u�m��4����������|Y� �r7���C�����!5:��-l�]��y��yF���!�$�SdH����vb+E�'���[m��C�	��>�A�R�����:��F��6e�x����!� �1��5w����#�*�A���B�S��#�2���{]�Np��,��w��j�����/�@��Z��c����|�������I�P���~��Y�E�|_�&O�ou�i�I������Bu�"�415�Y�"���]di�������0����o�
"V�����iP�p�"����]���� ^'����4������S���4qy����?{���P�=\�E���,b���&�W'������,���yk�"��)��(`��+8o:eW����`�g������i���B�'��e������,��g+��6|����w�/�M��=S�p2-��4"m�H�����t�<���X=rW���e�������u��k���\ex>��3i�U��f#|<��4�w��H�(��8bf{���(k�9#�o���`e��J���6O�t^S
�[�h��Te|d"Yq�����@+����
���C6r�o��@C���G��x��nL��Vq�����g�<4�{�I����.��Pp�!MP���
X9�(!0� dR9�(!�������k�����<�s^��G���V8Z�7a��@hm����H8��L7E����N\f�~���F.9�Nfm>���8,L�E����C����#�����
��kr';aNv���{jBa��A��A@�h�p6�����2��v5���!���+���g��������m��vnk�A��T���*��M��)M>{���]�	/w�6�����(�n���h�}e!&�c�.�[�i^��U�s�3���B���M�4x}!���t���-/1��A�u��f�����������|���ikAWbm���r�Q��4��
��T���i�?���%�����t�G�|��dUP�l�-���)���1f%|�;bq���S��l�JDP�M������(�/����7K��%g��<L��w���vF=>z��<��Siq|�>g����	�B&T5�q��$z
{|�������4����lfd�T)���U ~Q�B������R���.N
yl����c�_���2�A9[���:d�^EW���X�h������6���ENS���V��fq��<�(��{���t-�-��t��,�
��IX���
�[��*Z�������k��eT���&��������=��,l�,MM2�F<C>��'(?h���XP�I�����c6*�����"����K��m�[d�&&��������h���%�_`��I���"�����E��T����l|���y:pt�n�/@����A8������>�"�C��d\���e�m>���-���E]yg?2�h���J�%km�1����Q.��3����W�i����*��6�����q�`tS�����d ���Emn54�J������#�7���E]
���K�!��|�����C�������'���%8m}*�������	�4|����W��4��G�&���$�6�����lKq*,��A^x`�9��n�b�����9��S�KXc8����������%����'f@ .?�,�g�b�-����V�1[���x�	�\����!�i��XV�]����= {�Uq,a)�*�3=���:��(���������5�mu=��zN��e��V"]������q����@�x����ZIb7y��>y�i�U~�i:����z
.�D/6z��|�S�C�+a�D�g[�K��wr�%O#�e�����YRik~�BX�p=�Sg����U�g6��j��]�
�
�ox�?M+�o���{�O���������@���Q��~7��@KYB���C��M��D������L���.u��]���3���>���z���	��\7�D;�p=�*��AgU�����&���U�if1�E.QhF��V0Ha��V)9Z�����y7\�{p�����_\t��erU�xI(Y��5�$���;��c����+9���WB.��C�������	�PK��6G'��2PEU_perf_reports/20GB_preload/ps_8_workers_20GB_preload_5.4_selectivity_8_task_queue_multiplier.txt�]{��4��O1B�t�4I���x��8�@!����MzI�������>�m2c{����c7m��c��f<3v������!M���S(����J���7|:t���
��]U�'O!Y�~YTP��H*�������c�:�#��s;����KQ��F�UV��u����.�ER�WXsK!/����n��o��}��$�|�~3U���k1����gQ��������?0������_��6����s���~��I���������9,��(�d�C�w}���'��b������`��4�[��(�pq	K�|���F�5�l��CH�����@}��O -�Yv)�o���������M�1�?o>]�2��('j���J4��'���e���e�{6���p�`7�x��g�~�q���������Y>+��N&s�V8�_���4����n��������p����E�X��<�Q��������X�Y�EZ���{�rY���O!��q�sy^C�92��H��i�X$9^�//����'���jyq��s���.pxu\�?�/��
Cg�����,E0�&�����S':������ty����|���X�Y����<���"B����sQ�Q���Y^�������5������������U���;��|^�-���i����^�i\'���O�����|���5�����%L,�u�����P]��u��\�x�����WuR��zUv�'s���bV��m�Q3�+�O���jM�-�XbCG���X-�����<^f��z�]���Uj���d�=.��
-��
�XrJ��q��:�@�| N>�&������C��i�������Npa+��Y��������-i��%+J����
	W���d9A��Wv����wq���5	61��H�R���\/d�v-�:&���J�HUI�F����[�':2�='��u����l�������"�W(��U������O#���j�1�w_+�iK�'���|[�O������B�k��)��*.?<�����h��d��E��p�
��G��i�v�dR{P��Z���MR���&��bN�����jD^*/����K@�% ��x	��,^"/A?/���t���E�C8s+��Q�4��#�n�X��b5�4���5����
�@���?��*n��I�+4�\�A���C�m�R���(��M91Q���� ���
2M�Z�2K�$MW��
�A�O���Q�xA�����wF���q��Z�����i<��E�2N�.��=Ig��{9w���E���p�1�X����G��:#��y�t'�
zK�o��z2��Z9���(oD���:n��
=9I��s��������d5�����o(w.H�T(����*�m��<�p��Y{�2�����	�6&�2'�����'w@�F�l.)��T%%�I�|b[Rj@��y
��CB6(��QI<<'v�G�jf���T�� r!�K��d9e��W�x+��H������5�\�\�d^�,�����I?�9�^d�%�*Q�x�	��	��z1�[>��������x�,������MV�|��i��}��<T:a�, WM�^�����]��S�F�^���jj�����E���><���I�JmBk(k�(����!V-�q�D��A���iq��g��3�W�����y�hM��Jwi�DM��F�%��Ju������3��LWq����$I����e�&��M0)���	�et
�}Z��}rZF�>!USe����OF����������c9���B���#��X�����f|!1&�F�������/�&��!&�&�,��1��|kPv���Sn-���97<���u�U�1f+1�����k3]m�,��)�hf�����)�_���n� ��t�;���t�S0k��)�X�`�:
��I�p�mj#���}���?zh${uh=��8�����()(DI�7��DI��h-�I����c����c��
�Rl@Y}*(J�Jd+?��3+��QR2�Yx�X@*��8�J�/������ #$�T��\�`�����)�3���]�!�MWp�Uw'<��������o{�U��0s�q�k8��-���a��`H@@����k��=d>s��E�*�H�'���Qx4W�>��1����|L���;��R�)o,��my���TT8]oS�5�����	s���	Z��jO����nL�,Sy�K�*����v%`oE�X��B���{�s�r@���2YwoE��F#'r��>s����H�+�����z���7l�sn'�7
;zN��y���|�O���U��6��T�������C�5|�vG�	�U�GULS%+�
�k-�^�#-GK��OVv+f����+�� �>l�!~"���v��dX����	�0 �`���{�D�+����� 'Q��N�rU7'�,��m������}i��K�5�]�����K��1����">��5G��Ds�C�\���Ji�Jv{9���n�_������G��"��s�����%Qi����U��>)m*�B>������b!��t`�i
��^�'��
:+
�Yi��JTW`J��	��+�J�m1%"G�R8^��/��w�����������x�t�Lz�&��IS��}�g�([":������{��^'m8��Y>]�Z	������ZF��v�"��"�6'�L&���S��X+���PP#�,���
�����������o`!6pa���"C��t�,������sX�7(���
_)�j��Y�,J,�,|���\Vj�~��F�{���JCd���d�;�[4�`��x@�����'��}}Q��t�"�H�������:��"t�pl�l���������l\��������-a��P�;!�#�����
�f3o�����A�I�;7/E-W��l�p^�&��^	=w�sw�M��RQb����Y�f��mtS��NHf����{w�������oTM�V�j�UY�*h]�f�b�Z�����B�����,���[������@e�mY�Ma�W����p�S
l
�R��s��m�����:�aUM�6�o�v�$'�v�������g����m�0�z���:��#��@8�&�uQ|��(�Cf]�������u�9Be�#���e_~�x����/����a�+�&q�_�w�
JJ
T5���tpp���x�!����eY�a���sX�7+�fz*�J����l����������v��>Y����?Y��u�-�3(����sN�5�/�	�F�1c����B�������&ti`�>��X ,�,�v]��'����}����(G���v��Xr��k%X�QU��E7�?�������q��4�@�|`M>'�'H�/�Q�::������9�l�K�N��<���?vF�B"VY��������`��� O6^�8��'���ap����4�������p����M�g=�/+�������0�]�����l%��t����* �o�#
�?mz��|n+�f7��,75������I�b|6X���5�j�����L�A��2��$@e���L9��:C��;�Xsr������3�Kl�#8v���,��R9�����4}��Z��3�� ����
&�P(��$<<��Lm?�tlJP�)A��S&th	��:���:��+�������8<&������T<&����c�&=�"� '<�`K[NDMg��x�&(��@#F|�CodOE%������w��
J�(����:�ej���Q�@�a7=��V���?B�ow���*1o�t�5�@��U���:�t�k�m�|n��z3+�������'���C�j�V�	y�r*?��a�J]��jM��+S!��[4��!T�z�p����5�������(�Q(�����
������U�,�P�Y���9:��q����0/$�p���I����)m7��'��4<cC�<gl�qI*��6�����9�yEK8l2�xn+�f#l�W����A��"�6S)�m:U��,Z��r��Kv[h�rl$%@�^����t���d�����u�����KW1�D�u3��$�����������	1�M�	Zh��i���CAG���U
��Lp
��&����QV+:���@��zU��(��Pa�D��`[P�0��I�M���]���0�m�z�~����tJ��@�>x�=NRh���#Y���a���9�%���M2S*-|"px��h�9�q8��G<����8�
yK��A��I��li������d�M�.�_��ZV�6��������a�j����g����)����vo����/
�	gu���ZPUs�G6��vgg:&K����y]_�vA�cY��~]��#�}"0"�e<)�L{������J���&�u�a78 �X�#�e��,p���o	�����DP�[T�H����S�w�'.���KM����X��a�G����_���jb����&.�%�H#S�v8�dX��$7��>=��?�}45a��M
��L��3���g$�����?vZ�LX����PART�BqQA��8��n��g(��~��r?�aw�{<�iQ9wQ��������o�B8���������}|E��v9���J����dm��V�e�������*���v������<~�����}j���-�>�Y/��K�M�^�d����4Er��dy��63�h%qv��#
H���$�.W�F�YI�_�d��#�xG��u���l�eKeL��N�?���)�7��~GV_��L��}�V���������kG+��:ixSk�p���Z���F(*�b��V�����2��2O�����k�Z]�&��������������Y��J�Z���&�F����;��g�\�����#�~GT�0u%���l�k�$2u:��K�w�c�!��.����������qQd�����_�����&t!h��rF&��kYf�N������[f���P�D���$���T6B5�`�"P�R ���X�B�~�[)��V���\��l�OS�%���4_F��Z56[�q�~l����$���ZtP��p���+��$:���0I{���!�x���D�4�.����Q L�b��>�����m`��s��P���
����[�����Ud�d���nS�nn��\#��8A����;0���
�5��(,�O1�C
5_�0$M�#1y����7��M�Kr��oAN�r������_��� +ppE��a?3ez0��xL�#�Xo��A��v������Wn�[�S�Wi�4d/g�b��=a����66Y�U��gm�����Y����F����p�RSN��T�r\��}
���$�_��~������/�,��+U��������x��i��vN$I����~�����y487y�[q����tz)�!��8�������s�{J�!j]�C�i��9�0��p�!`�C�F:��c������%�N��9���;�p(�9��
�C0�����0P26�%�Rrd
�
������VU��x��%��J��5��/��c�}���n6c���:���m��i3�R����s�5c��2^�w�d�{�b��h�D}������������X����a��;�9��~w��������4]����9?eP����8�,Ge�4��E��H��H��H��h^���Hj ��-��-~���g��0�{*G>���j�w��p�>|�	�Dl�-�P��j�Z%�cU�%�$B���PW�_D}�y�$���T
�V��U
�z�Xt�|����&������^��^��^��^��^��^��^��^J��RR�,��������_~z(��i�d>���C�s(���P�����C9-J&J&EJ&]J&-J&J&J&J���ze�v�9�Vu�i��~v����G@12��� DT =�2 �!+'�l���5Q ���J����$>%�{1�
���Ki�~�g_y��l��H%WS��Q�TjU:��>Ke�Y����#���,�P�7D��r�3*��@�[DX�l�$��F��O�<0|�����:�.(�D 5Q�R����*^��U��U��Qs!��P��P��8I�9s��������RL�w��<���'�"O�_��+H���Rn#�G������m��ns�a�����y���<�uE�KF�-�"X'L6uq[j�Tr�� X&��)��wx�&��`JQ/I�K�m�_$�~�]�V��lQ[���`�
��j+���V�l��M��/Rxx\Im
�*��7�]�k*��?>�f�)�/�L��/8������:y����I�/1i���Vz1�g�4�a�?���v�R\~y-y2�$g7q�zt*PU��8�18��7��6�Cl��t��e����8.�i�Ms���=�k�{����<�AB�$�I$q��Ge���yG������$�O������!�l
��q+�h�Qy hA�@����0
2�+�Z�1� c XA�P��PW������oSw6L�sZ�q�b�<�������H��K�55��nK.�c�N�m��|��#�����Xv[���Zn)���W���p��:5iz��8lZg�=_OD�����>�m���ba��c�
-�lB�rX�XB�8�c[���`A>dX���P�Gd2�V�A�j�M���	�7���p^k��'s��w�W6i���aV�DP
���V�`0��Z��
8�����2I������1��X���2|���O#��6����r��_s+mT�x6L!- 
 _  / )���3�Y")���6�k�9�7��L��G��8�P�����x*����7!�CB\/�����ib�\�	*�O�@�g�&��s����z0$��I�$><��/GR��^�\5��g�t�N9qn�������y�l�w����sZK���w[�����"Z�`��f�����gv�]���zX�[��R��{M����#�"�� j���Y9���������N��w+��u�����v�~��E,#2�� 8�A\��q0�eo�����z�5U��<�e�2[��/6�x���.�+�{���V��L������[����s�]�[|���[t��e����:��C�a��;�=Z��%�m��ZZ-B��V�P�E��"�jd��Z��	���'�G����	�������^p����W���Bp�������k�)k�%����o���|,R.GcJ���]���R%�#���Re�*ef��M��)LZ{E�B5*�}�g�]|1���l���B��X���6�tF�����i���p��/G�"������AYRHZ��c'��m���.��"�f�$�/so�
�V��P��I������@.�F�2�8��"��T�_1���5�`��+1�-&�����4`����`���$/x.��/Tl6��V��6xpM�������A�z����A,� P=H��i���s+�JJJ=��OK�@me����TW���^��Y��(,����q9�ifr�%�������N�����J@*��U���4�F��p���D��%�h/��H{��L�KGh6x������A����:�%�i/��P{����r��R�?�f�[�tW��WEK=2<No�mPy�_z���$�"qX��m����S�DZ���2]:���j���mn?��a7���t��Jh�'����K������Mso��[:G�C��8%wD�l�r��a��5�=�J�{��?S��e��8�,s���Ae�a�k���fp���F���al����n���I���%���@�����@��`�I{��
D��@q����G��~���1�~zd6x�~�c#ZL�r�D���a���h9Bc�0]L�h�8L���a�L�h����E�R�����:h�zCc���74������q�]��&�$7��g�����7�'9H<�M�)�E��9�5at<K����cF�/��Jv��y1�:���-�������������8Rk���e��Ic�h�m���Z�kQU��w%��x����*���mif;��`����q������v��k��j���KS�"�-���~u�?��r����kY�uT�_�i� ��E�^��tC�/�IG��qw�n�o��~H--.P�r[�2��/1�
G��k�:�uxA��L����z
���,~�]>���An�-��`��i�O"��Nf���1�L
���0����P�]y��B}n�JZ�5v�����H3����p��D������	l�R����d]S�y��`�m�D��^XS����4xk��	A6�`����{B���'d�}�6��J'�����N@�'�'x���)�(2�'%�H��`����w��@�<�A8����!a��������c��E���K�����������:G�#�#�*��������R��<�xx�X�"J�Ym���\=��)���d��%�	6�	o������<3��<{4�U�����jvj6�6�r���z}�}]�/���10���DPh����i#��\��E���g��a6�����\.n��~%u��>(E�9�/�������a���n/��n������c�Q�v(���|
fo�+J�>S�Mc����F���<����a|��1�"����X���WA�WA���`Y����i�$�gx&����:l���o�/���Ll�h:��9 N�Y,���wL�%$3Q�S��q��Q>�nMz����IT��#.M���k���m�����~�M�fz�e��J�������������]�{:�cr�n!��Wo�[�f���<%���� ��Sb��@������yjaf�A!���0�O�iLU�`5vqF���!t<�����������C@	P�P%P@����tH�~#CH@�(z<�a�7��iI:<�`8S�0Vh����#3�2��\�T�}��w?����cv��������=Vg��8����(bW*n���g��_PEs<9�"�n@%��|�kI���������� �C)��7�-���l�%�n�C8D[���Ha�d�"U�U�^=���aQ#R�0$J����d��w�����'3����-,D���r�P��}�O�	�>��O �I�}��?��)ew��v������|o�=��S<��H�$j��<��;,�_3�N�������Ov�0p�g"��/�J'���m�r!�����PD���%�����g�bt��H�z"�h�9�R��g"+��%��bU}~�������C��,6��Y�Z������6�4���a'�?L����p1��C�9����j$����-����h.����F�aK�m�_�mM����s<g�.)Ud��C����e[�!��Z����~$�����^��g�ojY/�4<����iK|�g�)�����#��hb��K��R��A����p���r�����s@y�rb�j]�����E���}"v��{G�y��{:�Gx&z���m�&�PmD�w<���bX�6b�,Q\�f���5;���l�����v0B���8k~�T�>>��5GQ���O��.$����B���������]Ec��!��ge�O���:��+A�g��__���wY���h����,��[��00'x����I�O�!-_�����?��H����G��#��&X���H��o_�e�2�Dw^*�a��T��v7�j�1�I�1�A\U����7��>�q�����{� -�PHIJ������&r��j�o"U���n����K^X�8p�Z����������
?�~��0���z���	w��*-�#8�

�BC��,4

ABC��Kh�
ABC��t�=����wu��I��]K�d��x��"a����Zoh%�	Cf��o!�b��@�#T�<B�`�#P�<��X�G���^�<���_{TX�U������u��D�D��-�jC
��9���K�9�T����P�����Q�"#���r�`��}��O8���}�Oz���	�M{��:jiJ<V4������ih�����l��Z�<�� �E�|ab�����a��C>l�}�b�����%�!����'d/���4���j_��Y���A\�@�gV�?ak�����Y�J�7���L���$�0
,bv�ay!�/4aG)�����ec���_-Q$
G�T�}�:$&��r��gP��6�+�)|(��}���-O�������1bwx5�j�G��z5H�Ow��G:�����������Q�R�~8����E�(���]��
�����
&b���]����M���M��<����#	=�s��j$��4�V�IV7�^�>/-
�#��U�-�t�,������|�2I�(l�����"@}y -]PIk��T2�����d`!��B_=��d�d��Wb�q�����|�r���#M:*��`���� �+���$+@�02�u �������(����p'��syw�����o��z���K���z�F�4��2;F:�����
���m��Y�������������L;|X}����
0��<��k���l0l�t��x&r��L�>���[ur�{mz��)����c--Bo�i��n��@��c-�'�~m���H8u�p�	�s=�������t��f��F�����N+
D��mtn��8p2�T�����Gw���j��V�����t���vF��d�%>$�U�*�_TS�Nj]���j8���o~��� )����Pv�Z�)�S(��~����^�Y���U���w�@�y�g"������k��U
A��L���/Gh|R������0�.�{W���
D����5"@���*�R����VE�!�K/$l>�{�}m'!>����k?l�J2�93���~�.�n�2<g9�I���������{���!�����Yn�;;��4�����)������T����?�d�]�����������R��z3�)joX
�����}�/��T���B�� �\a�������?q�D����2�}8�!��^9r��C���+4�����o!�w���)�j�R��9s�hn��$��m���	���t�r����_�}��}�&E���*FN�.���b�!��E�3�T+��9y��g�B�v.D�h�U~w�Cs!�;Q�S�w��W��[���pU1Z��L=}�AtS{����u�n_��g�n���uCw�6b��@��l��41���V�"����&08�����&K�r�(~��i�]9���?vn�#�'��{)�;��;�Z�T�e���a���3W��l��u��u��Hf,y�z�B:�s���}��Xu��x��0�����b��5�C7�0��\���g�!�
p��3��d������;*e��K�����,?���;��Hm����{���]������	��8KG_%hT
���|MV&29�O������l���i���Z���]��,�)��&)�f�6	� t�0@)K���W����p3a�.SFsg�7��w����p�*��x	���*wLj9���}����,������������/�Q7�H�4����w)b���k��;�[�qZ���,����������[1���Y���y��r��O�V�o��m�Fm�oc����yP�a�#x��@�
<��G ���#���G�d�RWqt$O4��
��9�Q���<r���$�h�����%�Lc�����'�j:N2}���I��������5������c��v�f���uxq�Cg�v���	?]9�l��t=��YCKK�G����.��[|� g���5R�������r�xop�A���[�6%4k=�-NZ�)-�Ob���� .-��t���N�<�}��W6;@�v��9�<�����!P�1�]��o��6V����dx���{��5n�L�p�@aV��M���@���3��GJ����W��7��\7�7��k�`LR�:�V&*�&0���c�� ���y��k;�"�N�|��Yk����1s��g��6�l	�_M%��t����?��o��A�Oo>T�������)��*L?�J�:��x��Y��r<*�FQ�	;L�K�����c�Cz���X����!���dxY�kE�)�D1�9z��L�Tt���X
����t����M��Y>d�������:�T��������L���*���M�����=��n��V��+{t� 0&b��@0�L��XV�vy@�o��|�`U��0�r(%z���;V�a�SxKg&����s�:@�A��H�>���5P��l����<���#����5�/����/q����m��������5��(�5����Q�^���+���^������YIw�b��cA�� h,��tC��@:�0�R9i
����1���Skk��3[�7�
���PJ(B	D(�%�"� ��Pb!�@���!�n���yp�`�Y�tM�NEcT�Z�<�Y���I�59_��a_9�]�&��M���7�����F�ZcfM���q:����5�_��n�����6��M���Oc�������h/�K�}���P��T�J�R	W*��J�RI�T��*��&f�?���7�h��i�0o�Xx`���7��������\n�J��������\�=���/��\-o�/~�i�+�n}�&�w���#��C ���ms�Y,HQ�w�`��1���uu��j���!|j�p�����z�^u�S��8�o�|�&�1T�����8*31�>�����F���#��4z���,*���?�c��!������n�� ���c�F��
i���}�c| Y��=��0��J]���>�����]8����;������Q0������D12F&��D�HO&���xd���T�BfP�V��=����Na�3	/8a�)�X����?P{x�q��y����So�{�s9��A��u��0!�������kp��4�����2����F0�U�����*+\����Y�ps������>��8�e���[��"%�7��	�����:D/a@��L*x�s�I�� :������@����Rh�e�c�``$x��m:�VXW�/��{�D������ Y���^ts2���dq,��gP���K}��;��=�)�D�-Yf�V�j����Xl�6����gk�Al=kx��&�M��rWP"Q��(�h����3�����fZ���1dQ����C���Oc�WN5s�yu��v}U�WZm���X5����J P	)�p%�7&g�6� ��a��L6.>�@�7�Xen�e
�*��q�*��Q�U��s�U��9������v�eg�2�L	_��3n���T�F�M����<��G��d��[���`
�8�4�SM���;�������8��V���s\�E��y���d)D�)C e�A�S�x�!=���h�gj�����y�kPpv*���e8����8C��v��4[s
" ��CM`�0#�Y��������b��L����zi(DhR
�
�QT�
���^��j���8������8�G��O���;\�.;+1�n�����]A�
�������&]�Pr��%Tb�q�B`%�d=�������Q�Y�k��`�U�E�s�4�k�0�R�uK�'���;gr��������W6u��u��t5��eG,zp��~ uz"����Y������F������6��o�]?�a�Yc����'�^�N�I�z�����v�����Q(V��v!���O��;���}���O�Y=a�!�+���l�_��Ev��Q
A������?��~u�h��%�� V��R"/�F��I�9��|����y�)R��S?�I�D�6�
?��72
�����Q����	��|�?|��?�4��&�}�� �����)o�����Sg�wa_�^���>��'�������j�/f�@�EJ=
z7�#���a�t��+�jq��;{�{6

�t�v���aK��=(o���F9��Z�.�H�<|2���*��j-Ga(#v����3?���|�5����|� ������m������CV��8�[��|����E�@Y[Ld����F!�I�&K�A���i�M�K�pk��>����i&��0�I�NS��E*�C�����IBu[�?�V�>��<���mX�'����ND�O���e����7����K����'E�h�y4\�]���S�1S��,����t/J�T�}��Cm��M$�gL!x�t���i6�[v�93��1q� L� $�������jU`N0s������fd4��!E�����|�\FP�M��%x`�t�Z�*�ls��=?��@'���t&���-���$�m��p��Bm��d�6	��wq���vH]����p���_�����S��O�I�j� ���>3��s*���0��W[�j}�c����R���d�lP��0������M;(�Zj�T��������<9�.�p �
l{�����4J��>>�����x��a=�{� /�
(_�� c�X��SO�LD+g`D#�f����*��p�t�Og�N���VI=�a�5G���Oa"�����m�M�`S&��c���V�'�<X|�h�?�E����\��M��AJ<��
��q"��i]=��Ewz�B�D���F�mvWs�3?���7a�+�!������Ld��8�GZ����-�*?�����o�P/���Z�Y�$En��$Aj6�k�D�@����Xp�����zl	�
�ars�)�"Y���{�m�g��*�}TO��L-����/21_��/B�C0rDA�!9�B��!=rB�h���&uI���"jPp(��3������^S�mBPj���	�����d�9)��`���v���-^���Aq��eJ�o{W��4�����H�A����J��'-���v�&��d&.��z.,��������YN��H��$<���OR��oR_�4�:lBn������m v��vd�F�ZW�K/����B����d��%J��G�5]�S^={��2����Z-;[�����;��u�RV���o���7���l��<�D������Z;��2A�������
^�e�����[X�[��H�*na����X)��!rF�>�c�����w(H��U��<+����@�SOZ4��!����_c��1����'v����}B��q����9����57�K�! X�@��8���.��Y�W����n����ev�6P����6�;���B��P[}�����~L|L��P��E����Gi�~�}:���������=r���P���8}�
�A���������_��Xnv��uK�$~#����I|�����o���&��MD�;���3��FJ<P?
(����;3:�
L�C�0g�	�8���nw�m	c�%`h����}WUd\�����M
�u!X�0�(��EZ$���|�R��
�q,�z��<�����hxG�g���3��\#��-k.Y6�*r{H���1�N��W�����E�����Q5���Z�*p�����X��|���U�B���'���
8�`M	N0�'D�lh�����)�\��W��Q(��?�J�$$����ci�����7T ��@�f�D;�cT%��h��!H�v���[$8���d'!t�TmWLEWj�p�G���i9���L�]��������4��&"�|���2�9Nh~]6��(����XPwv�>%��Q��ZP�1
��-���s���Ak
�n�>�e�<�8C�����*��2�L�J�\|�Ci��:!��� ����<���9�6r�f���C����R�F�,j��^lWE�n����!�����J����'����8�T���?�X^$�y����"[��\�)���ac���sm�)�c�O���������|�3emP�})��<�Y3>��x���v1������b��ge��0�k���+�,�
�=�����5[��T`B�aZ\]�A������������p�M���C1�b��x�'
�c|�]K�FS#�j����"J�A(L
�����lO�v���9��q����X�!��se��
*�8/��_���^��c"���.tqb�lu��@I�\X
�S,b�����xV��Px&T
����.p8�vwDT�6��8
XdY�E��Hd
Y�m�����y����u`���11{=��@�b�����P���/�H�h����sL�n���h��(V9v$K�;�������Kx�~9��@��{���zo�@Q�:�I1hk��,*�U.�.2��Ar�H�|Z�3=tZ�P���8�m�r{~TBn-���G��)�$?PK���]�H���.��v������4)�_0e.�\9����r7������-��H0�NeN5�<����e�&�L��x���m�n�[�O�����y�N����[��j��p�� ��}��8�W�����>Zs������:�c���<���xQ��R�>wRII�&b�<��&��:���H�x��~e��}/���b>�*�����f��YZ3U�fC���c�p7�����|j`����<p>qSP�P���@\��J�+=�Vz|<�_��J�XqklDX�F6����=�5w�C������H���8����8�;Aj/D��� ��?���������XE[u��jd���/�������cm~�����6>�>$S�|����O��N�����s��3��	}v��W�|z��S����rAX�I>�e����1�����Qg���u�k�q�,�r�QI���@�4�$e�������A�8Z"�c�����0�t���U�`e��������q��L�d����
R>���G[��Z{��
�+�S����������r��b\D���=;���MWmj�I�4�2o���4�����&O�1��
��'��Jp��Y�8Y��+;����,?�Oo��heT
��H,O@����B_F������b%��"������#��ej��o���������l�}8&�����O�]��a>&��1�f(8�N���-�����
���u�����g{��Z�����������c>�P_}
���@W���������E�aK^]��K"�x��T��?��3����b�F��|��S�hs?�r��"["���l�����iHQ����R6Aur������q��|��R��.�6	B}>����'����P`m������w�Dt3�G4�CU��j�^w%�	3��cK�"'�P]��`�T�$"O�i�D�eBv��K��;����r���"���2�;y-8g_����6QF���U6{��y��`��myTq5��ek��G,R��#*�4�%2��@*$��_��J����:�N��QW����]="�Je��h=�-ZrN����Yw9y��I�~�����#����,�;���Z����<�.>S���Ad�����
|E�C�i�V_� yL��V����[�n��
w�"��n�p[9�[(�\�S���
z��/Q����*#�~R�$�?PK?��6G
$perf_reports/
 �!*���!*��������PK?��6G$+perf_reports/12GB_preload/
 ��*����*��vd�)��PK?��6G0rr�e7�_$ cperf_reports/12GB_preload/ps_0_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 y5*����*����*��PK?��6G����*��_$ E8perf_reports/12GB_preload/ps_0_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 ��*��5v*��5v*��PK?��6G���]�.�_$ �cperf_reports/12GB_preload/ps_0_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt
 ��*���d*���d*��PK?��6GHu��$ay_$ "�perf_reports/12GB_preload/ps_2_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 
*�����)�����)��PK?��6G��,��I��_$ d�perf_reports/12GB_preload/ps_2_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 ��*���A�)���A�)��PK?��6G�fQH&��_$ eperf_reports/12GB_preload/ps_2_workers_12GB_preload_2.7_selectivity_8_task_queue_multiplier.txt
 2.�)��+	�)��+	�)��PK?��6G��M
`��_$ *)perf_reports/12GB_preload/ps_2_workers_12GB_preload_5.4_selectivity_1_task_queue_multiplier.txt
 ���)��e[�)��e[�)��PK?��6G*jH��_$ ��perf_reports/12GB_preload/ps_2_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt
 ~��)��/��)��/��)��PK?��6G���U%s�_$ D�perf_reports/12GB_preload/ps_4_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 ���)��(��)��(��)��PK?��6GB�A��PH^_$ �perf_reports/12GB_preload/ps_4_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 ���)��u�)��u�)��PK?��6G0ty1�K_$ {Iperf_reports/12GB_preload/ps_4_workers_12GB_preload_2.7_selectivity_8_task_queue_multiplier.txt
 �$�)��b�)��b�)��PK?��6G#>���[J4_$ �eperf_reports/12GB_preload/ps_4_workers_12GB_preload_5.4_selectivity_1_task_queue_multiplier.txt
 �(�)����)����)��PK?��6G�.O)CN_$ ��perf_reports/12GB_preload/ps_4_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt
 ���)�����)�����)��PK?��6G&?�� �I_$ �perf_reports/12GB_preload/ps_8_workers_12GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 )V�)��"1�)��"1�)��PK?��6G�z��I�/_$ �&perf_reports/12GB_preload/ps_8_workers_12GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 �)���o�)���o�)��PK?��6Gck�J:Ud_$ �pperf_reports/12GB_preload/ps_8_workers_12GB_preload_2.7_selectivity_8_task_queue_multiplier.txt
 c��)�����)�����)��PK?��6GM��
�R��_$ ��perf_reports/12GB_preload/ps_8_workers_12GB_preload_5.4_selectivity_1_task_queue_multiplier.txt
 g��)����)����)��PK?��6Gi&p�XEhJ_$ ��perf_reports/12GB_preload/ps_8_workers_12GB_preload_5.4_selectivity_8_task_queue_multiplier.txt
 �J�)���w�)���w�)��PK?��6G$�Dperf_reports/20GB_preload/
 ��*����*���!*��PK?��6G�+����D_$ �Dperf_reports/20GB_preload/ps_0_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 ��*����*����*��PK?��6GKL��8�_$ 
cperf_reports/20GB_preload/ps_0_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 *��y�*��y�*��PK?��6G|b5�/y1_$ @�perf_reports/20GB_preload/ps_0_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt
 �*����*����*��PK?��6G��s��V�_$ ��perf_reports/20GB_preload/ps_2_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 ��*��vv*��vv*��PK?��6G�o��@��_$ ��perf_reports/20GB_preload/ps_2_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 �*����*����*��PK?��6G��q��~ _$ %perf_reports/20GB_preload/ps_2_workers_20GB_preload_2.7_selectivity_8_task_queue_multiplier.txt
 �*���z*���z*��PK?��6G�V�8TS�	_$ ]Bperf_reports/20GB_preload/ps_2_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt
 ��*���*���*��PK?��6G���^RA�_$ .�perf_reports/20GB_preload/ps_2_workers_20GB_preload_5.4_selectivity_8_task_queue_multiplier.txt
 �*���*���*��PK?��6G�6����_$ ��perf_reports/20GB_preload/ps_4_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 H�*����*����*��PK?��6G��=
�F�&_$ 9�perf_reports/20GB_preload/ps_4_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 �~*��H*��H*��PK?��6Gl�X?��_$ Q3perf_reports/20GB_preload/ps_4_workers_20GB_preload_2.7_selectivity_8_task_queue_multiplier.txt
 �
*��#
*��#
*��PK?��6G�U��Qg�_$ 
Kperf_reports/20GB_preload/ps_4_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt
 :q
*����
*����
*��PK?��6G��v�G�T_$ 2�perf_reports/20GB_preload/ps_4_workers_20GB_preload_5.4_selectivity_8_task_queue_multiplier.txt
 ��*��[w
*��[w
*��PK?��6G���fl�}_$ 7�perf_reports/20GB_preload/ps_8_workers_20GB_preload_0.1_selectivity_1_task_queue_multiplier.txt
 7)
*��0	*��0	*��PK?��6G]�r��<��_$  �perf_reports/20GB_preload/ps_8_workers_20GB_preload_2.7_selectivity_1_task_queue_multiplier.txt
 �
*����*����*��PK?��6GK���::�_$ 3perf_reports/20GB_preload/ps_8_workers_20GB_preload_2.7_selectivity_8_task_queue_multiplier.txt
 �*����*����*��PK?��6Gn��fG�{_$ inperf_reports/20GB_preload/ps_8_workers_20GB_preload_5.4_selectivity_1_task_queue_multiplier.txt
 �B*���*���*��PK?��6G'��2PEU_$ L�perf_reports/20GB_preload/ps_8_workers_20GB_preload_5.4_selectivity_8_task_queue_multiplier.txt
 �D*���!*���!*��PK''�
#339Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#332)
2 attachment(s)
Re: Parallel Seq Scan

On Fri, Sep 18, 2015 at 5:31 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Sep 17, 2015 at 4:44 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I haven't studied the planner logic in enough detail yet to have a
clear opinion on this. But what I do think is that this is a very
good reason why we should bite the bullet and add outfuncs/readfuncs
support for all Plan nodes. Otherwise, we're going to have to scan
subplans for nodes we're not expecting to see there, which seems
silly. We eventually want to allow all of those nodes in the worker
anyway.

makes sense to me. There are 39 plan nodes and it seems we have
support for all of them in outfuncs and needs to add for most of them
in readfuncs.

Attached patch (read_funcs_v1.patch) contains support for all the plan
and other nodes (like SubPlan which could be required for worker) except
CustomScan node. CustomScan contains TextOutCustomScan and doesn't
contain corresponding Read function pointer, we could add the support for
same, but I am not sure if CustomScan is required to be passed to worker
in near future, so I am leaving it for now.

To verify the patch, I have done 2 things, first I have added elog to
the newly supported read funcs and then in planner, I have used
nodeToString and stringToNode on planned_stmt and then used the
newly generated planned_stmt for further execution. After making these
changes, I have ran make check-world and ensures that it covers all the
newly added nodes.

Note, that as we don't populate funcid's in expressions during read, the
same has to be updated by traversing the tree and updating in different
expressions based on node type. Attached patch (read_funcs_test_v1)
contains the changes required for testing the patch. I am not very sure
about what do about some of the ForeignScan fields (fdw_private) in order
to update the funcid as the data in those expressions could be FDW specific.
This is anyway for test, so doesn't matter much, but the same will be
required to support read of ForeignScan node by worker.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

read_funcs_v1.patchapplication/octet-stream; name=read_funcs_v1.patchDownload
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index df55b76..badb2d8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -11,8 +11,8 @@
  *	  src/backend/nodes/readfuncs.c
  *
  * NOTES
- *	  Path and Plan nodes do not have any readfuncs support, because we
- *	  never have occasion to read them in.  (There was once code here that
+ *	  Path nodes do not have any readfuncs support, because we never
+ *	  have occasion to read them in.  (There was once code here that
  *	  claimed to read them, but it was broken as well as unused.)  We
  *	  never read executor state trees, either.
  *
@@ -29,6 +29,7 @@
 #include <math.h>
 
 #include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
 #include "nodes/readfuncs.h"
 
 
@@ -67,6 +68,12 @@
 	token = pg_strtok(&length);		/* get field value */ \
 	local_node->fldname = atoui(token)
 
+/* Read an long integer field (anything written as ":fldname %ld") */
+#define READ_LONG_FIELD(fldname) \
+	token = pg_strtok(&length);		/* skip :fldname */ \
+	token = pg_strtok(&length);		/* get field value */ \
+	local_node->fldname = atol(token)
+
 /* Read an OID field (don't hard-wire assumption that OID is same as uint) */
 #define READ_OID_FIELD(fldname) \
 	token = pg_strtok(&length);		/* skip :fldname */ \
@@ -144,6 +151,10 @@
 
 
 static Datum readDatum(bool typbyval);
+static AttrNumber *readAttrNumberCols(int numCols);
+static Oid *readOidCols(int numCols);
+static int *readIntCols(int numCols);
+static bool *readBoolCols(int numCols);
 
 /*
  * _readBitmapset
@@ -1367,6 +1378,949 @@ _readTableSampleClause(void)
 	READ_DONE();
 }
 
+/*
+ * _readDefElem
+ */
+static DefElem *
+_readDefElem(void)
+{
+	READ_LOCALS(DefElem);
+
+	READ_STRING_FIELD(defnamespace);
+	READ_STRING_FIELD(defname);
+	READ_NODE_FIELD(arg);
+	READ_ENUM_FIELD(defaction, DefElemAction);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlan
+ */
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS(Plan);
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+
+	READ_DONE();
+}
+
+/*
+ * ReadCommonPlan
+ *	Assign the basic stuff of all nodes that inherit from Plan
+ */
+static void
+ReadCommonPlan(Plan *node)
+{
+	Plan	   *local_plan;
+
+	local_plan = _readPlan();
+	node->startup_cost = local_plan->startup_cost;
+	node->total_cost = local_plan->total_cost;
+	node->plan_rows = local_plan->plan_rows;
+	node->plan_width = local_plan->plan_width;
+	node->targetlist = local_plan->targetlist;
+	node->qual = local_plan->qual;
+	node->lefttree = local_plan->lefttree;
+	node->righttree = local_plan->righttree;
+	node->initPlan = local_plan->initPlan;
+	node->extParam = local_plan->extParam;
+	node->allParam = local_plan->allParam;
+}
+
+/*
+ * _readResult
+ */
+static Result *
+_readResult(void)
+{
+	READ_LOCALS(Result);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_NODE_FIELD(resconstantqual);
+
+	READ_DONE();
+}
+
+/*
+ * _readModifyTable
+ */
+static ModifyTable *
+_readModifyTable(void)
+{
+	READ_LOCALS(ModifyTable);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_ENUM_FIELD(operation, CmdType);
+	READ_BOOL_FIELD(canSetTag);
+	READ_UINT_FIELD(nominalRelation);
+	READ_NODE_FIELD(resultRelations);
+	READ_INT_FIELD(resultRelIndex);
+	READ_NODE_FIELD(plans);
+	READ_NODE_FIELD(withCheckOptionLists);
+	READ_NODE_FIELD(returningLists);
+	READ_NODE_FIELD(fdwPrivLists);
+	READ_NODE_FIELD(rowMarks);
+	READ_INT_FIELD(epqParam);
+	READ_ENUM_FIELD(onConflictAction, OnConflictAction);
+	READ_NODE_FIELD(arbiterIndexes);
+	READ_NODE_FIELD(onConflictSet);
+	READ_NODE_FIELD(onConflictWhere);
+	READ_UINT_FIELD(exclRelRTI);
+	READ_NODE_FIELD(exclRelTlist);
+
+	READ_DONE();
+}
+
+/*
+ * _readAppend
+ */
+static Append *
+_readAppend(void)
+{
+	READ_LOCALS(Append);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_NODE_FIELD(appendplans);
+
+	READ_DONE();
+}
+
+/*
+ * _readMergeAppend
+ */
+static MergeAppend *
+_readMergeAppend(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(MergeAppend);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_NODE_FIELD(mergeplans);
+	READ_INT_FIELD(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :sortColIdx */
+	if (local_node->numCols)
+		local_node->sortColIdx = readAttrNumberCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :sortOperators */
+	if (local_node->numCols)
+		local_node->sortOperators = readOidCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :collations */
+	if (local_node->numCols)
+		local_node->collations = readOidCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :nullsFirst */
+	if (local_node->numCols)
+		local_node->nullsFirst = readBoolCols(local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readRecursiveUnion
+ */
+static RecursiveUnion *
+_readRecursiveUnion(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(RecursiveUnion);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_INT_FIELD(wtParam);
+	READ_INT_FIELD(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :dupColIdx */
+	if (local_node->numCols)
+		local_node->dupColIdx = readAttrNumberCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :dupOperators */
+	if (local_node->numCols)
+		local_node->dupOperators = readOidCols(local_node->numCols);
+
+	READ_LONG_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapAnd
+ */
+static BitmapAnd *
+_readBitmapAnd(void)
+{
+	READ_LOCALS(BitmapAnd);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_NODE_FIELD(bitmapplans);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapOr
+ */
+static BitmapOr *
+_readBitmapOr(void)
+{
+	READ_LOCALS(BitmapOr);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_NODE_FIELD(bitmapplans);
+
+	READ_DONE();
+}
+
+/*
+ * _readScan
+ */
+static Scan *
+_readScan(void)
+{
+	READ_LOCALS(Scan);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_UINT_FIELD(scanrelid);
+
+	READ_DONE();
+}
+
+/*
+ * ReadCommonScan
+ *	Assign the basic stuff of all nodes that inherit from Scan
+ */
+static void
+ReadCommonScan(Scan *node)
+{
+	Scan	   *local_scan;
+
+	local_scan = _readScan();
+
+	node->plan.startup_cost = local_scan->plan.startup_cost;
+	node->plan.total_cost = local_scan->plan.total_cost;
+	node->plan.plan_rows = local_scan->plan.plan_rows;
+	node->plan.plan_width = local_scan->plan.plan_width;
+	node->plan.targetlist = local_scan->plan.targetlist;
+	node->plan.qual = local_scan->plan.qual;
+	node->plan.lefttree = local_scan->plan.lefttree;
+	node->plan.righttree = local_scan->plan.righttree;
+	node->plan.initPlan = local_scan->plan.initPlan;
+	node->plan.extParam = local_scan->plan.extParam;
+	node->plan.allParam = local_scan->plan.allParam;
+
+	node->scanrelid = local_scan->scanrelid;
+}
+
+/*
+ * _readSeqScan
+ */
+static SeqScan *
+_readSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(SeqScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readSampleScan
+ */
+static SampleScan *
+_readSampleScan(void)
+{
+	READ_LOCALS(SampleScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_NODE_FIELD(tablesample);
+
+	READ_DONE();
+}
+
+/*
+ * _readIndexScan
+ */
+static IndexScan *
+_readIndexScan(void)
+{
+	READ_LOCALS(IndexScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_OID_FIELD(indexid);
+	READ_NODE_FIELD(indexqual);
+	READ_NODE_FIELD(indexqualorig);
+	READ_NODE_FIELD(indexorderby);
+	READ_NODE_FIELD(indexorderbyorig);
+	READ_NODE_FIELD(indexorderbyops);
+	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+
+	READ_DONE();
+}
+
+/*
+ * _readIndexOnlyScan
+ */
+static IndexOnlyScan *
+_readIndexOnlyScan(void)
+{
+	READ_LOCALS(IndexOnlyScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_OID_FIELD(indexid);
+	READ_NODE_FIELD(indexqual);
+	READ_NODE_FIELD(indexorderby);
+	READ_NODE_FIELD(indextlist);
+	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapIndexScan
+ */
+static BitmapIndexScan *
+_readBitmapIndexScan(void)
+{
+	READ_LOCALS(BitmapIndexScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_OID_FIELD(indexid);
+	READ_NODE_FIELD(indexqual);
+	READ_NODE_FIELD(indexqualorig);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapHeapScan
+ */
+static BitmapHeapScan *
+_readBitmapHeapScan(void)
+{
+	READ_LOCALS(BitmapHeapScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_NODE_FIELD(bitmapqualorig);
+
+	READ_DONE();
+}
+
+/*
+ * _readTidScan
+ */
+static TidScan *
+_readTidScan(void)
+{
+	READ_LOCALS(TidScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_NODE_FIELD(tidquals);
+
+	READ_DONE();
+}
+
+/*
+ * _readSubqueryScan
+ */
+static SubqueryScan *
+_readSubqueryScan(void)
+{
+	READ_LOCALS(SubqueryScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_NODE_FIELD(subplan);
+
+	READ_DONE();
+}
+
+/*
+ * _readFunctionScan
+ */
+static FunctionScan *
+_readFunctionScan(void)
+{
+	READ_LOCALS(FunctionScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_NODE_FIELD(functions);
+	READ_BOOL_FIELD(funcordinality);
+
+	READ_DONE();
+}
+
+/*
+ * _readValuesScan
+ */
+static ValuesScan *
+_readValuesScan(void)
+{
+	READ_LOCALS(ValuesScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_NODE_FIELD(values_lists);
+
+	READ_DONE();
+}
+
+/*
+ * _readCteScan
+ */
+static CteScan *
+_readCteScan(void)
+{
+	READ_LOCALS(CteScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_INT_FIELD(ctePlanId);
+	READ_INT_FIELD(cteParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readWorkTableScan
+ */
+static WorkTableScan *
+_readWorkTableScan(void)
+{
+	READ_LOCALS(WorkTableScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_INT_FIELD(wtParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readForeignScan
+ */
+static ForeignScan *
+_readForeignScan(void)
+{
+	READ_LOCALS(ForeignScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_OID_FIELD(fs_server);
+	READ_NODE_FIELD(fdw_exprs);
+	READ_NODE_FIELD(fdw_private);
+	READ_NODE_FIELD(fdw_scan_tlist);
+	READ_BITMAPSET_FIELD(fs_relids);
+	READ_BOOL_FIELD(fsSystemCol);
+
+	READ_DONE();
+}
+
+/*
+ * _readJoin
+ */
+static Join *
+_readJoin(void)
+{
+	READ_LOCALS(Join);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_ENUM_FIELD(jointype, JoinType);
+	READ_NODE_FIELD(joinqual);
+
+	READ_DONE();
+}
+
+/*
+ * _ReadCommonJoin
+ *	Assign the basic stuff of all nodes that inherit from Join
+ */
+static void
+_ReadCommonJoin(Join *node)
+{
+	Join	   *local_join;
+
+	local_join = _readJoin();
+
+	node->plan.startup_cost = local_join->plan.startup_cost;
+	node->plan.total_cost = local_join->plan.total_cost;
+	node->plan.plan_rows = local_join->plan.plan_rows;
+	node->plan.plan_width = local_join->plan.plan_width;
+	node->plan.targetlist = local_join->plan.targetlist;
+	node->plan.qual = local_join->plan.qual;
+	node->plan.lefttree = local_join->plan.lefttree;
+	node->plan.righttree = local_join->plan.righttree;
+	node->plan.initPlan = local_join->plan.initPlan;
+	node->plan.extParam = local_join->plan.extParam;
+	node->plan.allParam = local_join->plan.allParam;
+
+	node->jointype = local_join->jointype;
+	node->joinqual = local_join->joinqual;
+}
+
+/*
+ * _readNestLoop
+ */
+static NestLoop *
+_readNestLoop(void)
+{
+	READ_LOCALS(NestLoop);
+
+	_ReadCommonJoin((Join *) local_node);
+
+	READ_NODE_FIELD(nestParams);
+
+	READ_DONE();
+}
+
+/*
+ * _readMergeJoin
+ */
+static MergeJoin *
+_readMergeJoin(void)
+{
+	int			numCols;
+	int			tokenLength;
+
+	READ_LOCALS(MergeJoin);
+
+	_ReadCommonJoin((Join *) local_node);
+
+	READ_NODE_FIELD(mergeclauses);
+
+	numCols = list_length(local_node->mergeclauses);
+
+	token = pg_strtok(&tokenLength);	/* skip :mergeFamilies */
+	if (numCols)
+		local_node->mergeFamilies = readOidCols(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :mergeCollations */
+	if (numCols)
+		local_node->mergeCollations = readOidCols(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :mergeStrategies */
+	if (numCols)
+		local_node->mergeStrategies = readIntCols(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :mergeNullsFirst */
+	if (numCols)
+		local_node->mergeNullsFirst = readBoolCols(numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readHashJoin
+ */
+static HashJoin *
+_readHashJoin(void)
+{
+	READ_LOCALS(HashJoin);
+
+	_ReadCommonJoin((Join *) local_node);
+
+	READ_NODE_FIELD(hashclauses);
+
+	READ_DONE();
+}
+
+/*
+ * _readMaterial
+ */
+static Material *
+_readMaterial(void)
+{
+	READ_LOCALS_NO_FIELDS(Material);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(Sort);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_INT_FIELD(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :sortColIdx */
+	if (local_node->numCols)
+		local_node->sortColIdx = readAttrNumberCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :sortOperators */
+	if (local_node->numCols)
+		local_node->sortOperators = readOidCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :collations */
+	if (local_node->numCols)
+		local_node->collations = readOidCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :nullsFirst */
+	if (local_node->numCols)
+		local_node->nullsFirst = readBoolCols(local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroup
+ */
+static Group *
+_readGroup(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(Group);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_INT_FIELD(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :grpColIdx */
+	if (local_node->numCols)
+		local_node->grpColIdx = readAttrNumberCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :grpOperators */
+	if (local_node->numCols)
+		local_node->grpOperators = readOidCols(local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readAgg
+ */
+static Agg *
+_readAgg(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(Agg);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_ENUM_FIELD(aggstrategy, AggStrategy);
+	READ_INT_FIELD(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :grpColIdx */
+	if (local_node->numCols)
+		local_node->grpColIdx = readAttrNumberCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :grpOperators */
+	if (local_node->numCols)
+		local_node->grpOperators = readOidCols(local_node->numCols);
+
+	READ_LONG_FIELD(numGroups);
+
+	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(chain);
+
+	READ_DONE();
+}
+
+/*
+ * _readWindowAgg
+ */
+static WindowAgg *
+_readWindowAgg(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(WindowAgg);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_UINT_FIELD(winref);
+	READ_INT_FIELD(partNumCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :partColIdx */
+	if (local_node->partNumCols)
+		local_node->partColIdx = readAttrNumberCols(local_node->partNumCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :partOperators */
+	if (local_node->partNumCols)
+		local_node->partOperators = readOidCols(local_node->partNumCols);
+
+	READ_INT_FIELD(ordNumCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :ordColIdx */
+	if (local_node->ordNumCols)
+		local_node->ordColIdx = readAttrNumberCols(local_node->ordNumCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :ordOperators */
+	if (local_node->ordNumCols)
+		local_node->ordOperators = readOidCols(local_node->ordNumCols);
+
+	READ_INT_FIELD(frameOptions);
+	READ_NODE_FIELD(startOffset);
+	READ_NODE_FIELD(endOffset);
+
+	READ_DONE();
+}
+
+/*
+ * _readUnique
+ */
+static Unique *
+_readUnique(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(Unique);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_INT_FIELD(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :uniqColIdx */
+	if (local_node->numCols)
+		local_node->uniqColIdx = readAttrNumberCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :uniqOperators */
+	if (local_node->numCols)
+		local_node->uniqOperators = readOidCols(local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readHash
+ */
+static Hash *
+_readHash(void)
+{
+	READ_LOCALS(Hash);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_OID_FIELD(skewTable);
+	READ_INT_FIELD(skewColumn);
+	READ_BOOL_FIELD(skewInherit);
+	READ_OID_FIELD(skewColType);
+	READ_INT_FIELD(skewColTypmod);
+
+	READ_DONE();
+}
+
+/*
+ * _readSetOp
+ */
+static SetOp *
+_readSetOp(void)
+{
+	int			tokenLength;
+
+	READ_LOCALS(SetOp);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_ENUM_FIELD(cmd, SetOpCmd);
+	READ_ENUM_FIELD(strategy, SetOpStrategy);
+	READ_INT_FIELD(numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :dupColIdx */
+	if (local_node->numCols)
+		local_node->dupColIdx = readAttrNumberCols(local_node->numCols);
+
+	token = pg_strtok(&tokenLength);	/* skip :dupOperators */
+	if (local_node->numCols)
+		local_node->dupOperators = readOidCols(local_node->numCols);
+
+	READ_INT_FIELD(flagColIdx);
+	READ_INT_FIELD(firstFlag);
+	READ_LONG_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
+ * _readLockRows
+ */
+static LockRows *
+_readLockRows(void)
+{
+	READ_LOCALS(LockRows);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_NODE_FIELD(rowMarks);
+	READ_INT_FIELD(epqParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readLimit
+ */
+static Limit *
+_readLimit(void)
+{
+	READ_LOCALS(Limit);
+
+	ReadCommonPlan((Plan *) local_node);
+
+	READ_NODE_FIELD(limitOffset);
+	READ_NODE_FIELD(limitCount);
+
+	READ_DONE();
+}
+
+/*
+ * _readNestLoopParam
+ */
+static NestLoopParam *
+_readNestLoopParam(void)
+{
+	READ_LOCALS(NestLoopParam);
+
+	READ_INT_FIELD(paramno);
+	READ_NODE_FIELD(paramval);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlanRowMark
+ */
+static PlanRowMark *
+_readPlanRowMark(void)
+{
+	READ_LOCALS(PlanRowMark);
+
+	READ_UINT_FIELD(rti);
+	READ_UINT_FIELD(prti);
+	READ_UINT_FIELD(rowmarkId);
+	READ_ENUM_FIELD(markType, RowMarkType);
+	READ_INT_FIELD(allMarkTypes);
+	READ_ENUM_FIELD(strength, LockClauseStrength);
+	READ_ENUM_FIELD(waitPolicy, LockWaitPolicy);
+	READ_BOOL_FIELD(isParent);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readSubPlan
+ */
+static SubPlan *
+_readSubPlan(void)
+{
+	READ_LOCALS(SubPlan);
+
+	READ_ENUM_FIELD(subLinkType, SubLinkType);
+	READ_NODE_FIELD(testexpr);
+	READ_NODE_FIELD(paramIds);
+	READ_INT_FIELD(plan_id);
+	READ_STRING_FIELD(plan_name);
+	READ_OID_FIELD(firstColType);
+	READ_INT_FIELD(firstColTypmod);
+	READ_OID_FIELD(firstColCollation);
+	READ_BOOL_FIELD(useHashTable);
+	READ_BOOL_FIELD(unknownEqFalse);
+	READ_NODE_FIELD(setParam);
+	READ_NODE_FIELD(parParam);
+	READ_NODE_FIELD(args);
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(per_call_cost);
+
+	READ_DONE();
+}
+
+/*
+ * _readAlternativeSubPlan
+ */
+static AlternativeSubPlan *
+_readAlternativeSubPlan(void)
+{
+	READ_LOCALS(AlternativeSubPlan);
+
+	READ_NODE_FIELD(subplans);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1504,8 +2458,94 @@ parseNodeString(void)
 		return_value = _readTableSampleClause();
 	else if (MATCH("NOTIFY", 6))
 		return_value = _readNotifyStmt();
+	else if (MATCH("DEFELEM", 7))
+		return_value = _readDefElem();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PLAN", 4))
+		return_value = _readPlan();
+	else if (MATCH("RESULT", 6))
+		return_value = _readResult();
+	else if (MATCH("MODIFYTABLE", 11))
+		return_value = _readModifyTable();
+	else if (MATCH("APPEND", 6))
+		return_value = _readAppend();
+	else if (MATCH("MERGEAPPEND", 11))
+		return_value = _readMergeAppend();
+	else if (MATCH("RECURSIVEUNION", 14))
+		return_value = _readRecursiveUnion();
+	else if (MATCH("BITMAPAND", 9))
+		return_value = _readBitmapAnd();
+	else if (MATCH("BITMAPOR", 8))
+		return_value = _readBitmapOr();
+	else if (MATCH("SCAN", 4))
+		return_value = _readScan();
+	else if (MATCH("SEQSCAN", 7))
+		return_value = _readSeqScan();
+	else if (MATCH("SAMPLESCAN", 10))
+		return_value = _readSampleScan();
+	else if (MATCH("INDEXSCAN", 9))
+		return_value = _readIndexScan();
+	else if (MATCH("INDEXONLYSCAN", 13))
+		return_value = _readIndexOnlyScan();
+	else if (MATCH("BITMAPINDEXSCAN", 15))
+		return_value = _readBitmapIndexScan();
+	else if (MATCH("BITMAPHEAPSCAN", 14))
+		return_value = _readBitmapHeapScan();
+	else if (MATCH("TIDSCAN", 7))
+		return_value = _readTidScan();
+	else if (MATCH("SUBQUERYSCAN", 12))
+		return_value = _readSubqueryScan();
+	else if (MATCH("FUNCTIONSCAN", 12))
+		return_value = _readFunctionScan();
+	else if (MATCH("VALUESSCAN", 10))
+		return_value = _readValuesScan();
+	else if (MATCH("CTESCAN", 7))
+		return_value = _readCteScan();
+	else if (MATCH("WORKTABLESCAN", 13))
+		return_value = _readWorkTableScan();
+	else if (MATCH("FOREIGNSCAN", 11))
+		return_value = _readForeignScan();
+	else if (MATCH("JOIN", 4))
+		return_value = _readJoin();
+	else if (MATCH("NESTLOOP", 8))
+		return_value = _readNestLoop();
+	else if (MATCH("MERGEJOIN", 9))
+		return_value = _readMergeJoin();
+	else if (MATCH("HASHJOIN", 8))
+		return_value = _readHashJoin();
+	else if (MATCH("MATERIAL", 8))
+		return_value = _readMaterial();
+	else if (MATCH("SORT", 4))
+		return_value = _readSort();
+	else if (MATCH("GROUP", 5))
+		return_value = _readGroup();
+	else if (MATCH("AGG", 3))
+		return_value = _readAgg();
+	else if (MATCH("WINDOWAGG", 9))
+		return_value = _readWindowAgg();
+	else if (MATCH("UNIQUE", 6))
+		return_value = _readUnique();
+	else if (MATCH("HASH", 4))
+		return_value = _readHash();
+	else if (MATCH("SETOP", 5))
+		return_value = _readSetOp();
+	else if (MATCH("LOCKROWS", 8))
+		return_value = _readLockRows();
+	else if (MATCH("LIMIT", 5))
+		return_value = _readLimit();
+	else if (MATCH("NESTLOOPPARAM", 13))
+		return_value = _readNestLoopParam();
+	else if (MATCH("PLANROWMARK", 11))
+		return_value = _readPlanRowMark();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("SUBPLAN", 7))
+		return_value = _readSubPlan();
+	else if (MATCH("ALTERNATIVESUBPLAN", 18))
+		return_value = _readAlternativeSubPlan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
@@ -1576,3 +2616,87 @@ readDatum(bool typbyval)
 
 	return res;
 }
+
+/*
+ * readAttrNumberCols
+ */
+static AttrNumber *
+readAttrNumberCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	AttrNumber *attr_vals;
+
+	attr_vals = (AttrNumber *) palloc(numCols * sizeof(Oid));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		attr_vals[i] = atoi(token);
+	}
+
+	return attr_vals;
+}
+
+/*
+ * readOidCols
+ */
+static Oid *
+readOidCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	Oid		   *oid_vals;
+
+	oid_vals = (Oid *) palloc(numCols * sizeof(Oid));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		oid_vals[i] = atooid(token);
+	}
+
+	return oid_vals;
+}
+
+/*
+ * readIntCols
+ */
+static int *
+readIntCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	int		   *int_vals;
+
+	int_vals = (int *) palloc(numCols * sizeof(int));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		int_vals[i] = atoi(token);
+	}
+
+	return int_vals;
+}
+
+/*
+ * readBoolCols
+ */
+static bool *
+readBoolCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	bool	   *bool_vals;
+
+	bool_vals = (bool *) palloc(numCols * sizeof(bool));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		bool_vals[i] = strtobool(token);
+	}
+
+	return bool_vals;
+}
read_funcs_test_v1.patchapplication/octet-stream; name=read_funcs_test_v1.patchDownload
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index badb2d8..e0d244e 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1478,10 +1478,21 @@ _readResult(void)
 {
 	READ_LOCALS(Result);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Result Node")));
+
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_NODE_FIELD(resconstantqual);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Result Node verified")));
+
 	READ_DONE();
 }
 
@@ -1493,6 +1504,11 @@ _readModifyTable(void)
 {
 	READ_LOCALS(ModifyTable);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Modify Table Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_ENUM_FIELD(operation, CmdType);
@@ -1513,6 +1529,11 @@ _readModifyTable(void)
 	READ_UINT_FIELD(exclRelRTI);
 	READ_NODE_FIELD(exclRelTlist);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Modify Table verified")));
+
 	READ_DONE();
 }
 
@@ -1524,10 +1545,20 @@ _readAppend(void)
 {
 	READ_LOCALS(Append);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Append Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_NODE_FIELD(appendplans);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Append Node verified")));
+
 	READ_DONE();
 }
 
@@ -1541,6 +1572,11 @@ _readMergeAppend(void)
 
 	READ_LOCALS(MergeAppend);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check MergeAppend Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_NODE_FIELD(mergeplans);
@@ -1562,6 +1598,11 @@ _readMergeAppend(void)
 	if (local_node->numCols)
 		local_node->nullsFirst = readBoolCols(local_node->numCols);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("MergeAppend Node verified")));
+
 	READ_DONE();
 }
 
@@ -1575,6 +1616,11 @@ _readRecursiveUnion(void)
 
 	READ_LOCALS(RecursiveUnion);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check RecursiveUnion Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_INT_FIELD(wtParam);
@@ -1590,6 +1636,11 @@ _readRecursiveUnion(void)
 
 	READ_LONG_FIELD(numGroups);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("RecursiveUnion Node verified")));
+
 	READ_DONE();
 }
 
@@ -1601,10 +1652,20 @@ _readBitmapAnd(void)
 {
 	READ_LOCALS(BitmapAnd);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check BitmapAnd Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_NODE_FIELD(bitmapplans);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("BitmapAnd Node verified")));
+
 	READ_DONE();
 }
 
@@ -1616,10 +1677,20 @@ _readBitmapOr(void)
 {
 	READ_LOCALS(BitmapOr);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check BitmapOr Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_NODE_FIELD(bitmapplans);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("BitmapOr Node verified")));
+
 	READ_DONE();
 }
 
@@ -1672,8 +1743,18 @@ _readSeqScan(void)
 {
 	READ_LOCALS_NO_FIELDS(SeqScan);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check SeqScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("SeqScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1685,10 +1766,20 @@ _readSampleScan(void)
 {
 	READ_LOCALS(SampleScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check SampleScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_NODE_FIELD(tablesample);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("SampleScan Node verfied")));
+
 	READ_DONE();
 }
 
@@ -1700,6 +1791,11 @@ _readIndexScan(void)
 {
 	READ_LOCALS(IndexScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check IndexScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_OID_FIELD(indexid);
@@ -1710,6 +1806,11 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("IndexScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1721,6 +1822,11 @@ _readIndexOnlyScan(void)
 {
 	READ_LOCALS(IndexOnlyScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check IndexOnlyScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_OID_FIELD(indexid);
@@ -1729,6 +1835,11 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("IndexOnlyScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1740,12 +1851,22 @@ _readBitmapIndexScan(void)
 {
 	READ_LOCALS(BitmapIndexScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check BitmapIndexScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_OID_FIELD(indexid);
 	READ_NODE_FIELD(indexqual);
 	READ_NODE_FIELD(indexqualorig);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("BitmapIndexScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1757,10 +1878,20 @@ _readBitmapHeapScan(void)
 {
 	READ_LOCALS(BitmapHeapScan);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check BitmapHeapScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_NODE_FIELD(bitmapqualorig);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("BitmapHeapScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1772,10 +1903,20 @@ _readTidScan(void)
 {
 	READ_LOCALS(TidScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check TidScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_NODE_FIELD(tidquals);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("TidScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1787,10 +1928,20 @@ _readSubqueryScan(void)
 {
 	READ_LOCALS(SubqueryScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check SubqueryScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_NODE_FIELD(subplan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("SubqueryScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1802,11 +1953,21 @@ _readFunctionScan(void)
 {
 	READ_LOCALS(FunctionScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check FunctionScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_NODE_FIELD(functions);
 	READ_BOOL_FIELD(funcordinality);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("FunctionScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1818,10 +1979,20 @@ _readValuesScan(void)
 {
 	READ_LOCALS(ValuesScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check ValuesScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_NODE_FIELD(values_lists);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("ValuesScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1833,11 +2004,21 @@ _readCteScan(void)
 {
 	READ_LOCALS(CteScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check CteScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_INT_FIELD(ctePlanId);
 	READ_INT_FIELD(cteParam);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("CteScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1849,10 +2030,20 @@ _readWorkTableScan(void)
 {
 	READ_LOCALS(WorkTableScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check WorkTableScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_INT_FIELD(wtParam);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("WorkTableScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1864,6 +2055,11 @@ _readForeignScan(void)
 {
 	READ_LOCALS(ForeignScan);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check ForeignScan Node")));
+
 	ReadCommonScan((Scan *) local_node);
 
 	READ_OID_FIELD(fs_server);
@@ -1873,6 +2069,11 @@ _readForeignScan(void)
 	READ_BITMAPSET_FIELD(fs_relids);
 	READ_BOOL_FIELD(fsSystemCol);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("ForeignScan Node verified")));
+
 	READ_DONE();
 }
 
@@ -1927,10 +2128,20 @@ _readNestLoop(void)
 {
 	READ_LOCALS(NestLoop);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check NestLoop Node")));
+
 	_ReadCommonJoin((Join *) local_node);
 
 	READ_NODE_FIELD(nestParams);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("NestLoop Node verified")));
+
 	READ_DONE();
 }
 
@@ -1945,6 +2156,11 @@ _readMergeJoin(void)
 
 	READ_LOCALS(MergeJoin);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check MergeJoin Node")));
+
 	_ReadCommonJoin((Join *) local_node);
 
 	READ_NODE_FIELD(mergeclauses);
@@ -1967,6 +2183,11 @@ _readMergeJoin(void)
 	if (numCols)
 		local_node->mergeNullsFirst = readBoolCols(numCols);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("MergeJoin Node verified")));
+
 	READ_DONE();
 }
 
@@ -1978,10 +2199,20 @@ _readHashJoin(void)
 {
 	READ_LOCALS(HashJoin);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check HashJoin Node")));
+
 	_ReadCommonJoin((Join *) local_node);
 
 	READ_NODE_FIELD(hashclauses);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("HashJoin Node verified")));
+
 	READ_DONE();
 }
 
@@ -1992,6 +2223,11 @@ static Material *
 _readMaterial(void)
 {
 	READ_LOCALS_NO_FIELDS(Material);
+	
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Material Node")));
 
 	ReadCommonPlan((Plan *) local_node);
 
@@ -2008,6 +2244,11 @@ _readSort(void)
 
 	READ_LOCALS(Sort);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Sort Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_INT_FIELD(numCols);
@@ -2028,6 +2269,11 @@ _readSort(void)
 	if (local_node->numCols)
 		local_node->nullsFirst = readBoolCols(local_node->numCols);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Sort Node verified")));
+
 	READ_DONE();
 }
 
@@ -2041,6 +2287,11 @@ _readGroup(void)
 
 	READ_LOCALS(Group);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Group Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_INT_FIELD(numCols);
@@ -2053,6 +2304,11 @@ _readGroup(void)
 	if (local_node->numCols)
 		local_node->grpOperators = readOidCols(local_node->numCols);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Group Node verified")));
+
 	READ_DONE();
 }
 
@@ -2066,6 +2322,11 @@ _readAgg(void)
 
 	READ_LOCALS(Agg);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Agg Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_ENUM_FIELD(aggstrategy, AggStrategy);
@@ -2084,6 +2345,11 @@ _readAgg(void)
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(chain);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Agg Node verified")));
+
 	READ_DONE();
 }
 
@@ -2097,6 +2363,11 @@ _readWindowAgg(void)
 
 	READ_LOCALS(WindowAgg);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check WindowAgg Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_UINT_FIELD(winref);
@@ -2124,6 +2395,11 @@ _readWindowAgg(void)
 	READ_NODE_FIELD(startOffset);
 	READ_NODE_FIELD(endOffset);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("WindowAgg Node verified")));
+
 	READ_DONE();
 }
 
@@ -2137,6 +2413,11 @@ _readUnique(void)
 
 	READ_LOCALS(Unique);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Unique Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_INT_FIELD(numCols);
@@ -2149,6 +2430,11 @@ _readUnique(void)
 	if (local_node->numCols)
 		local_node->uniqOperators = readOidCols(local_node->numCols);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Unique Node verified")));
+
 	READ_DONE();
 }
 
@@ -2160,6 +2446,11 @@ _readHash(void)
 {
 	READ_LOCALS(Hash);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Hash Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_OID_FIELD(skewTable);
@@ -2168,6 +2459,11 @@ _readHash(void)
 	READ_OID_FIELD(skewColType);
 	READ_INT_FIELD(skewColTypmod);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Hash Node verified")));
+
 	READ_DONE();
 }
 
@@ -2181,6 +2477,11 @@ _readSetOp(void)
 
 	READ_LOCALS(SetOp);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check SetOp Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_ENUM_FIELD(cmd, SetOpCmd);
@@ -2199,6 +2500,11 @@ _readSetOp(void)
 	READ_INT_FIELD(firstFlag);
 	READ_LONG_FIELD(numGroups);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("SetOp Node verified")));
+
 	READ_DONE();
 }
 
@@ -2210,11 +2516,21 @@ _readLockRows(void)
 {
 	READ_LOCALS(LockRows);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check LockRows Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_NODE_FIELD(rowMarks);
 	READ_INT_FIELD(epqParam);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("LockRows Node verified")));
+
 	READ_DONE();
 }
 
@@ -2226,11 +2542,21 @@ _readLimit(void)
 {
 	READ_LOCALS(Limit);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check Limit Node")));
+
 	ReadCommonPlan((Plan *) local_node);
 
 	READ_NODE_FIELD(limitOffset);
 	READ_NODE_FIELD(limitCount);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Limit Node verified")));
+
 	READ_DONE();
 }
 
@@ -2242,9 +2568,19 @@ _readNestLoopParam(void)
 {
 	READ_LOCALS(NestLoopParam);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check NestLoopParam Node")));
+
 	READ_INT_FIELD(paramno);
 	READ_NODE_FIELD(paramval);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("NestLoopParam Node verified")));
+
 	READ_DONE();
 }
 
@@ -2256,6 +2592,11 @@ _readPlanRowMark(void)
 {
 	READ_LOCALS(PlanRowMark);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check PlanRowMark Node")));
+
 	READ_UINT_FIELD(rti);
 	READ_UINT_FIELD(prti);
 	READ_UINT_FIELD(rowmarkId);
@@ -2265,6 +2606,11 @@ _readPlanRowMark(void)
 	READ_ENUM_FIELD(waitPolicy, LockWaitPolicy);
 	READ_BOOL_FIELD(isParent);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("PlanRowMark Node verified")));
+
 	READ_DONE();
 }
 
@@ -2276,9 +2622,19 @@ _readPlanInvalItem(void)
 {
 	READ_LOCALS(PlanInvalItem);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check PlanInvalItem Node")));
+
 	READ_INT_FIELD(cacheId);
 	READ_UINT_FIELD(hashValue);
 
+ 	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("PlanInvalItem Node verified")));
+
 	READ_DONE();
 }
 
@@ -2290,6 +2646,11 @@ _readSubPlan(void)
 {
 	READ_LOCALS(SubPlan);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check SubPlan Node")));
+
 	READ_ENUM_FIELD(subLinkType, SubLinkType);
 	READ_NODE_FIELD(testexpr);
 	READ_NODE_FIELD(paramIds);
@@ -2306,6 +2667,11 @@ _readSubPlan(void)
 	READ_FLOAT_FIELD(startup_cost);
 	READ_FLOAT_FIELD(per_call_cost);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("SubPlan Node verified")));
+
 	READ_DONE();
 }
 
@@ -2317,8 +2683,18 @@ _readAlternativeSubPlan(void)
 {
 	READ_LOCALS(AlternativeSubPlan);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("Check AlternativeSubPlan Node")));
+
 	READ_NODE_FIELD(subplans);
 
+	ereport(LOG,
+			(errhidestmt(true),
+			 errhidecontext(true),
+			 errmsg("AlternativeSubPlan Node verified")));
+
 	READ_DONE();
 }
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 06be922..570786c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -135,6 +135,172 @@ static Plan *build_grouping_chain(PlannerInfo *root,
 					 long numGroups,
 					 Plan *result_plan);
 
+
+/*
+ * fix_node_funcids
+ *		Set the opfuncid (procedure OID) in an OpExpr node,
+ *		for plan tree.
+ *
+ * We need it mainly to fix the opfuncid in nodes of plantree
+ * after reading the planned statement by worker backend.
+ */
+static void
+fix_node_funcids(Plan *node)
+{
+	ListCell   *temp;
+
+	/*
+	 * do nothing when we get to the end of a leaf on tree.
+	 */
+	if (node == NULL)
+		return;
+
+	fix_opfuncids((Node*) node->qual);
+	fix_opfuncids((Node*) node->targetlist);
+
+	switch (nodeTag(node))
+	{
+		case T_Result:
+			fix_opfuncids((Node*) (((Result *) node)->resconstantqual));
+			break;
+		case T_ModifyTable:
+			foreach(temp, (List *) ((ModifyTable *) node)->plans)
+				fix_node_funcids((Plan *) lfirst(temp));
+
+			fix_opfuncids((Node*) (((ModifyTable *) node)->withCheckOptionLists));
+			fix_opfuncids((Node*) (((ModifyTable *) node)->returningLists));
+			/*
+			 * we should fix funcids for fdwPrivLists, but it is not
+			 * clear what kind of expressions it can contain.
+			 */
+			fix_opfuncids((Node*) (((ModifyTable *) node)->onConflictSet));
+			fix_opfuncids((Node*) (((ModifyTable *) node)->onConflictWhere));
+			fix_opfuncids((Node*) (((ModifyTable *) node)->exclRelTlist));
+			break;
+		case T_Append:
+			foreach(temp, (List *) ((Append *) node)->appendplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_MergeAppend:
+			foreach(temp, (List *) ((MergeAppend *) node)->mergeplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_RecursiveUnion:
+			break;
+		case T_BitmapAnd:
+			foreach(temp, (List *) ((BitmapAnd *) node)->bitmapplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_BitmapOr:
+			foreach(temp, (List *) ((BitmapOr *) node)->bitmapplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_Scan:
+			break;
+		case T_SeqScan:
+			break;
+		case T_SampleScan:
+			fix_opfuncids((Node*) (((SampleScan *) node)->tablesample));
+			break;
+		case T_IndexScan:
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexqual));
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexqualorig));
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexorderby));
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexorderbyorig));
+			break;
+		case T_IndexOnlyScan:
+			fix_opfuncids((Node*) (((IndexOnlyScan *) node)->indexqual));
+			fix_opfuncids((Node*) (((IndexOnlyScan *) node)->indexorderby));
+			fix_opfuncids((Node*) (((IndexOnlyScan *) node)->indextlist));
+			break;
+		case T_BitmapIndexScan:
+			fix_opfuncids((Node*) (((BitmapIndexScan *) node)->indexqual));
+			fix_opfuncids((Node*) (((BitmapIndexScan *) node)->indexqualorig));
+			break;
+		case T_BitmapHeapScan:
+			fix_opfuncids((Node*) (((BitmapHeapScan *) node)->bitmapqualorig));
+			break;
+		case T_TidScan:
+			fix_opfuncids((Node*) (((TidScan *) node)->tidquals));
+			break;
+		case T_SubqueryScan:
+			fix_node_funcids((Plan *) ((SubqueryScan *) node)->subplan);
+			break;
+		case T_FunctionScan:
+			fix_opfuncids((Node*) (((FunctionScan *) node)->functions));
+			break;
+		case T_ValuesScan:
+			fix_opfuncids((Node*) (((ValuesScan *) node)->values_lists));
+			break;
+		case T_CteScan:
+			break;
+		case T_WorkTableScan:
+			break;
+		case T_ForeignScan:
+			fix_opfuncids((Node*) (((ForeignScan *) node)->fdw_exprs));
+			/*
+			 * we should fix funcids for fdw_private, but it is not
+			 * clear what kind of expressions it can contain.
+			 */
+			fix_opfuncids((Node*) (((ForeignScan *) node)->fdw_scan_tlist));
+			break;
+		case T_Join:
+			fix_opfuncids((Node*) (((Join *) node)->joinqual));
+			break;
+		case T_NestLoop:
+			fix_opfuncids((Node*) (((NestLoop *) node)->join.joinqual));
+			foreach(temp, (List *) ((NestLoop*) node)->nestParams)
+				fix_opfuncids((Node*) ((NestLoopParam *) lfirst(temp))->paramval);
+			break;
+		case T_MergeJoin:
+			fix_opfuncids((Node*) (((MergeJoin *) node)->join.joinqual));
+			fix_opfuncids((Node*) (((MergeJoin *) node)->mergeclauses));
+			break;
+		case T_HashJoin:
+			fix_opfuncids((Node*) (((HashJoin *) node)->join.joinqual));
+			fix_opfuncids((Node*) (((HashJoin *) node)->hashclauses));
+			break;
+		case T_Material:
+			break;
+		case T_Sort:
+			break;
+		case T_Group:
+			break;
+		case T_Agg:
+			foreach(temp, (List *) ((Agg *) node)->chain)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_WindowAgg:
+			fix_opfuncids((Node*) (((WindowAgg *) node)->startOffset));
+			fix_opfuncids((Node*) (((WindowAgg *) node)->endOffset));
+			break;
+		case T_Unique:
+			break;
+		case T_Hash:
+			break;
+		case T_SetOp:
+			break;
+		case T_LockRows:
+			break;
+		case T_Limit:
+			fix_opfuncids((Node*) (((Limit *) node)->limitOffset));
+			fix_opfuncids((Node*) (((Limit *) node)->limitCount));
+			break;
+		case T_SubPlan:
+			fix_opfuncids((Node*) ((SubPlan *) node));
+			break;
+		case T_AlternativeSubPlan:
+			fix_opfuncids((Node*) ((AlternativeSubPlan *) node));
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	fix_node_funcids(node->lefttree);
+	fix_node_funcids(node->righttree);
+}
+
 /*****************************************************************************
  *
  *	   Query optimizer entry point
@@ -152,12 +318,25 @@ PlannedStmt *
 planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 {
 	PlannedStmt *result;
+	PlannedStmt *verify_plannedstmt;
+	char		*plannedstmt_str;
+	ListCell   *temp;
 
 	if (planner_hook)
 		result = (*planner_hook) (parse, cursorOptions, boundParams);
 	else
 		result = standard_planner(parse, cursorOptions, boundParams);
-	return result;
+
+	plannedstmt_str = nodeToString(result);
+
+	verify_plannedstmt = (PlannedStmt *) stringToNode(plannedstmt_str);
+
+	fix_node_funcids(verify_plannedstmt->planTree);
+
+	foreach(temp, (List *) verify_plannedstmt->subplans)
+		fix_node_funcids((Plan *) lfirst(temp));
+
+	return verify_plannedstmt;
 }
 
 PlannedStmt *
#340Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#339)
Re: Parallel Seq Scan

On Tue, Sep 22, 2015 at 3:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch (read_funcs_v1.patch) contains support for all the plan
and other nodes (like SubPlan which could be required for worker) except
CustomScan node.

It looks like you need to update the top-of-file comment for outfuncs.c.

Doesn't _readCommonPlan() leak? I think we should avoid that.
_readCommonScan() and _readJoin() are worse: they leak multiple
objects. It should be simple enough to avoid this: just have your
helper function take a Plan * as argument and then use
READ_TEMP_LOCALS() rather than READ_LOCALS(). Then the caller can use
READ_LOCALS, call the helper to fill in all the Plan fields, and then
read the other stuff itself.

Instead of passing the Plan down by casting, how about passing
&local_node->plan? And similarly for scans and joins.

readAttrNumberCols uses sizeof(Oid) instead of sizeof(AttrNumber).

I still don't understand why we need to handle PlanInvalItem.
Actually, come to think of it, I'm not sure we need PlannedStmt
either. Let's leave those out; they seem like trouble.

I think it would be worth doing something like this:

#define READ_ATTRNUMBER_ARRAY(fldname, len) \
token = pg_strtok(&length); \
local_node->fldname = readAttrNumberCols(len);

And similarly for READ_OID_ARRAY, READ_BOOL_ARRAY, READ_INT_ARRAY.

In general these routines are in the same order as plannodes.h, which
is good. But _readNestLoopParam is out of place. Can we move it just
after _readNestLoop?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#341Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Robert Haas (#340)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 4:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I haven't studied the planner logic in enough detail yet to have a
clear opinion on this. But what I do think is that this is a very
good reason why we should bite the bullet and add outfuncs/readfuncs
support for all Plan nodes. Otherwise, we're going to have to scan
subplans for nodes we're not expecting to see there, which seems
silly. We eventually want to allow all of those nodes in the worker
anyway.

makes sense to me. There are 39 plan nodes and it seems we have
support for all of them in outfuncs and needs to add for most of them
in readfuncs.

Attached patch (read_funcs_v1.patch) contains support for all the plan
and other nodes (like SubPlan which could be required for worker) except
CustomScan node. CustomScan contains TextOutCustomScan and doesn't
contain corresponding Read function pointer, we could add the support for
same, but I am not sure if CustomScan is required to be passed to worker
in near future, so I am leaving it for now.

Oh... I did exactly duplicated job a few days before.
https://github.com/kaigai/sepgsql/blob/readfuncs/src/backend/nodes/readfuncs.c

Regarding of CustomScan node, I'd like to run on worker process as soon as
possible once it gets supported. I'm highly motivated.

Andres raised a related topic a few weeks before:
/messages/by-id/20150825181933.GA19326@awork2.anarazel.de

Here are two issues:

* How to reproduce "methods" pointer on another process. Extension may not be
loaded via shared_preload_libraries.
-> One solution is to provide a pair of library and symbol name of the method
table, instead of the pointer. I think it is a reasonable idea.

* How to treat additional output of TextOutCustomScan.
-> Here are two solutions. (1) Mark TextOutCustomScan as an obsolete callback,
however, it still makes Andres concern because we need to form/deform private
data for copyObject safe. (2) Add TextReadCustomScan (and NodeEqualCustomScan?)
callback to process private fields.

To verify the patch, I have done 2 things, first I have added elog to
the newly supported read funcs and then in planner, I have used
nodeToString and stringToNode on planned_stmt and then used the
newly generated planned_stmt for further execution. After making these
changes, I have ran make check-world and ensures that it covers all the
newly added nodes.

Note, that as we don't populate funcid's in expressions during read, the
same has to be updated by traversing the tree and updating in different
expressions based on node type. Attached patch (read_funcs_test_v1)
contains the changes required for testing the patch. I am not very sure
about what do about some of the ForeignScan fields (fdw_private) in order
to update the funcid as the data in those expressions could be FDW specific.
This is anyway for test, so doesn't matter much, but the same will be
required to support read of ForeignScan node by worker.

Because of interface contract, it is role of FDW driver to put nodes which
are safe to copyObject on fdw_exprs and fdw_private field. Unless FDW driver
does not violate, fdw_exprs and fdw_private shall be reproduced on the worker
side. (Of course, we cannot guarantee nobody has local pointer on private
field...)
Sorry, I cannot understand the sentence of funcid population. It seems to me
funcid is displayed as-is, and _readFuncExpr() does nothing special...?

Robert Haas said:

I think it would be worth doing something like this:

#define READ_ATTRNUMBER_ARRAY(fldname, len) \
token = pg_strtok(&length); \
local_node->fldname = readAttrNumberCols(len);

And similarly for READ_OID_ARRAY, READ_BOOL_ARRAY, READ_INT_ARRAY.

Like this?
https://github.com/kaigai/sepgsql/blob/readfuncs/src/backend/nodes/readfuncs.c#L133

I think outfuncs.c also have similar macro to centralize the format of array.
Actually, most of boolean array are displayed using booltostr(), however, only
_outMergeJoin() uses "%d" format to display boolean as integer.
It is a bit inconsistent manner.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#342Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#340)
Re: Parallel Seq Scan

On Wed, Sep 23, 2015 at 5:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 22, 2015 at 3:21 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

readAttrNumberCols uses sizeof(Oid) instead of sizeof(AttrNumber).

I still don't understand why we need to handle PlanInvalItem.

As such this is not required, just to maintain consistency as I have added
other similar nodes like PlanRowMark and NestLoopParam.

Actually, come to think of it, I'm not sure we need PlannedStmt
either.

PlannedStmt is needed because we are passing the same from master
to worker for execution and the reason was that Executor interfaces
expect it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#343Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#342)
Re: Parallel Seq Scan

On Tue, Sep 22, 2015 at 9:18 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

PlannedStmt is needed because we are passing the same from master
to worker for execution and the reason was that Executor interfaces
expect it.

I thought we were passing the Plan and then the worker was constructed
a PlannedStmt around it. If we're passing the PlannedStmt then I
guess we need PlanInvalItem too, since there is a list of those
hanging off of the PlannedStmt.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#344Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#341)
Re: Parallel Seq Scan

On Tue, Sep 22, 2015 at 9:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Oh... I did exactly duplicated job a few days before.
https://github.com/kaigai/sepgsql/blob/readfuncs/src/backend/nodes/readfuncs.c

Please post the patch here, and clarify that it is under the PostgreSQL license.

Regarding of CustomScan node, I'd like to run on worker process as soon as
possible once it gets supported. I'm highly motivated.

Great.

Andres raised a related topic a few weeks before:
/messages/by-id/20150825181933.GA19326@awork2.anarazel.de

Here are two issues:

* How to reproduce "methods" pointer on another process. Extension may not be
loaded via shared_preload_libraries.

The parallel mode stuff already has code to make sure that the same
libraries that were loaded in the original backend get loaded in the
new one. But that's not going to make the same pointer valid there.

-> One solution is to provide a pair of library and symbol name of the method
table, instead of the pointer. I think it is a reasonable idea.

I agree.

* How to treat additional output of TextOutCustomScan.
-> Here are two solutions. (1) Mark TextOutCustomScan as an obsolete callback,
however, it still makes Andres concern because we need to form/deform private
data for copyObject safe. (2) Add TextReadCustomScan (and NodeEqualCustomScan?)
callback to process private fields.

I don't see how making it obsolete solves anything. Any node that
wants to run in a worker needs to have outfuncs and readfuncs support.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#345Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Robert Haas (#344)
Re: Parallel Seq Scan

On Tue, Sep 22, 2015 at 9:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Oh... I did exactly duplicated job a few days before.

https://github.com/kaigai/sepgsql/blob/readfuncs/src/backend/nodes/readfuncs
.c

Please post the patch here, and clarify that it is under the PostgreSQL license.

Of course. I intend to submit.

Regarding of CustomScan node, I'd like to run on worker process as soon as
possible once it gets supported. I'm highly motivated.

Great.

Andres raised a related topic a few weeks before:

/messages/by-id/20150825181933.GA19326@awork2.anarazel.
de

Here are two issues:

* How to reproduce "methods" pointer on another process. Extension may not be
loaded via shared_preload_libraries.

The parallel mode stuff already has code to make sure that the same
libraries that were loaded in the original backend get loaded in the
new one. But that's not going to make the same pointer valid there.

-> One solution is to provide a pair of library and symbol name of the method
table, instead of the pointer. I think it is a reasonable idea.

I agree.

* How to treat additional output of TextOutCustomScan.
-> Here are two solutions. (1) Mark TextOutCustomScan as an obsolete callback,
however, it still makes Andres concern because we need to form/deform private
data for copyObject safe. (2) Add TextReadCustomScan (and

NodeEqualCustomScan?)

callback to process private fields.

I don't see how making it obsolete solves anything. Any node that
wants to run in a worker needs to have outfuncs and readfuncs support.

In actually, I'm inclined to the (2) rather than (1).
In case of (2), we shall need two new callbacks on _copyCustomScan and
_readCustomScan. I'll try to make up.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#346Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#340)
2 attachment(s)
Re: Parallel Seq Scan

On Wed, Sep 23, 2015 at 5:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 22, 2015 at 3:21 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Attached patch (read_funcs_v1.patch) contains support for all the plan
and other nodes (like SubPlan which could be required for worker) except
CustomScan node.

It looks like you need to update the top-of-file comment for outfuncs.c.

Updated.

Doesn't _readCommonPlan() leak?

I have tried that way to keep the code simple with a view that this will
not be read in long-lived memory context, however your suggestion
below makes sense and can avoid it.

I think we should avoid that.
_readCommonScan() and _readJoin() are worse: they leak multiple
objects. It should be simple enough to avoid this: just have your
helper function take a Plan * as argument and then use
READ_TEMP_LOCALS() rather than READ_LOCALS(). Then the caller can use
READ_LOCALS, call the helper to fill in all the Plan fields, and then
read the other stuff itself.

Changed as per suggestion.

Instead of passing the Plan down by casting, how about passing
&local_node->plan? And similarly for scans and joins.

Changed as per suggestion.

readAttrNumberCols uses sizeof(Oid) instead of sizeof(AttrNumber).

Fixed.

I still don't understand why we need to handle PlanInvalItem.
Actually, come to think of it, I'm not sure we need PlannedStmt
either. Let's leave those out; they seem like trouble.

As discussed below this is required and I haven't changed it.

I think it would be worth doing something like this:

#define READ_ATTRNUMBER_ARRAY(fldname, len) \
token = pg_strtok(&length); \
local_node->fldname = readAttrNumberCols(len);

And similarly for READ_OID_ARRAY, READ_BOOL_ARRAY, READ_INT_ARRAY.

Changed as per suggestion.

In general these routines are in the same order as plannodes.h, which
is good. But _readNestLoopParam is out of place. Can we move it just
after _readNestLoop?

I have kept them in order they appear in nodes.h (that way it seems easier
to keep track if anything is missed), however if you want I can reorder them
as per your suggestion.

Note - Test is changed to verify just the output of readfuncs with changes
in
planner. I have removed elog related changes in readfuncs, as during last
test, I have verified that make check-world covers all types of nodes that
are added by patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

read_funcs_v2.patchapplication/octet-stream; name=read_funcs_v2.patchDownload
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e1b49d57..db64caa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -14,8 +14,8 @@
  *	  Every node type that can appear in stored rules' parsetrees *must*
  *	  have an output function defined here (as well as an input function
  *	  in readfuncs.c).  For use in debugging, we also provide output
- *	  functions for nodes that appear in raw parsetrees, path, and plan trees.
- *	  These nodes however need not have input functions.
+ *	  functions for nodes that appear in raw parsetrees and path.  These
+ *	  nodes however need not have input functions.
  *
  *-------------------------------------------------------------------------
  */
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index df55b76..29f3a5b 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -11,8 +11,8 @@
  *	  src/backend/nodes/readfuncs.c
  *
  * NOTES
- *	  Path and Plan nodes do not have any readfuncs support, because we
- *	  never have occasion to read them in.  (There was once code here that
+ *	  Path nodes do not have any readfuncs support, because we never
+ *	  have occasion to read them in.  (There was once code here that
  *	  claimed to read them, but it was broken as well as unused.)  We
  *	  never read executor state trees, either.
  *
@@ -29,6 +29,7 @@
 #include <math.h>
 
 #include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
 #include "nodes/readfuncs.h"
 
 
@@ -67,6 +68,12 @@
 	token = pg_strtok(&length);		/* get field value */ \
 	local_node->fldname = atoui(token)
 
+/* Read an long integer field (anything written as ":fldname %ld") */
+#define READ_LONG_FIELD(fldname) \
+	token = pg_strtok(&length);		/* skip :fldname */ \
+	token = pg_strtok(&length);		/* get field value */ \
+	local_node->fldname = atol(token)
+
 /* Read an OID field (don't hard-wire assumption that OID is same as uint) */
 #define READ_OID_FIELD(fldname) \
 	token = pg_strtok(&length);		/* skip :fldname */ \
@@ -122,6 +129,26 @@
 	(void) token;				/* in case not used elsewhere */ \
 	local_node->fldname = _readBitmapset()
 
+/* Read an attribute number array */
+#define READ_ATTRNUMBER_ARRAY(fldname, len) \
+	token = pg_strtok(&length);		/* skip :fldname */ \
+	local_node->fldname = readAttrNumberCols(len);
+
+/* Read an oid array */
+#define READ_OID_ARRAY(fldname, len) \
+	token = pg_strtok(&length);		/* skip :fldname */ \
+	local_node->fldname = readOidCols(len);
+
+/* Read an int array */
+#define READ_INT_ARRAY(fldname, len) \
+	token = pg_strtok(&length);		/* skip :fldname */ \
+	local_node->fldname = readIntCols(len);
+
+/* Read a bool array */
+#define READ_BOOL_ARRAY(fldname, len) \
+	token = pg_strtok(&length);		/* skip :fldname */ \
+	local_node->fldname = readBoolCols(len);
+
 /* Routine exit */
 #define READ_DONE() \
 	return local_node
@@ -144,6 +171,10 @@
 
 
 static Datum readDatum(bool typbyval);
+static bool *readBoolCols(int numCols);
+static int *readIntCols(int numCols);
+static Oid *readOidCols(int numCols);
+static AttrNumber *readAttrNumberCols(int numCols);
 
 /*
  * _readBitmapset
@@ -1367,6 +1398,809 @@ _readTableSampleClause(void)
 	READ_DONE();
 }
 
+/*
+ * _readDefElem
+ */
+static DefElem *
+_readDefElem(void)
+{
+	READ_LOCALS(DefElem);
+
+	READ_STRING_FIELD(defnamespace);
+	READ_STRING_FIELD(defname);
+	READ_NODE_FIELD(arg);
+	READ_ENUM_FIELD(defaction, DefElemAction);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlannedStmt
+ */
+static PlannedStmt *
+_readPlannedStmt(void)
+{
+	READ_LOCALS(PlannedStmt);
+
+	READ_ENUM_FIELD(commandType, CmdType);
+	READ_UINT_FIELD(queryId);
+	READ_BOOL_FIELD(hasReturning);
+	READ_BOOL_FIELD(hasModifyingCTE);
+	READ_BOOL_FIELD(canSetTag);
+	READ_BOOL_FIELD(transientPlan);
+	READ_NODE_FIELD(planTree);
+	READ_NODE_FIELD(rtable);
+	READ_NODE_FIELD(resultRelations);
+	READ_NODE_FIELD(utilityStmt);
+	READ_NODE_FIELD(subplans);
+	READ_BITMAPSET_FIELD(rewindPlanIDs);
+	READ_NODE_FIELD(rowMarks);
+	READ_NODE_FIELD(relationOids);
+	READ_NODE_FIELD(invalItems);
+	READ_INT_FIELD(nParamExec);
+	READ_BOOL_FIELD(hasRowSecurity);
+	READ_BOOL_FIELD(parallelModeNeeded);
+
+	READ_DONE();
+}
+
+/*
+ * ReadCommonPlan
+ *	Assign the basic stuff of all nodes that inherit from Plan
+ */
+static void
+ReadCommonPlan(Plan *local_node)
+{
+	READ_TEMP_LOCALS();
+
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(total_cost);
+	READ_FLOAT_FIELD(plan_rows);
+	READ_INT_FIELD(plan_width);
+	READ_NODE_FIELD(targetlist);
+	READ_NODE_FIELD(qual);
+	READ_NODE_FIELD(lefttree);
+	READ_NODE_FIELD(righttree);
+	READ_NODE_FIELD(initPlan);
+	READ_BITMAPSET_FIELD(extParam);
+	READ_BITMAPSET_FIELD(allParam);
+}
+
+/*
+ * _readPlan
+ */
+static Plan *
+_readPlan(void)
+{
+	READ_LOCALS_NO_FIELDS(Plan);
+
+	ReadCommonPlan(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readResult
+ */
+static Result *
+_readResult(void)
+{
+	READ_LOCALS(Result);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_NODE_FIELD(resconstantqual);
+
+	READ_DONE();
+}
+
+/*
+ * _readModifyTable
+ */
+static ModifyTable *
+_readModifyTable(void)
+{
+	READ_LOCALS(ModifyTable);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_ENUM_FIELD(operation, CmdType);
+	READ_BOOL_FIELD(canSetTag);
+	READ_UINT_FIELD(nominalRelation);
+	READ_NODE_FIELD(resultRelations);
+	READ_INT_FIELD(resultRelIndex);
+	READ_NODE_FIELD(plans);
+	READ_NODE_FIELD(withCheckOptionLists);
+	READ_NODE_FIELD(returningLists);
+	READ_NODE_FIELD(fdwPrivLists);
+	READ_NODE_FIELD(rowMarks);
+	READ_INT_FIELD(epqParam);
+	READ_ENUM_FIELD(onConflictAction, OnConflictAction);
+	READ_NODE_FIELD(arbiterIndexes);
+	READ_NODE_FIELD(onConflictSet);
+	READ_NODE_FIELD(onConflictWhere);
+	READ_UINT_FIELD(exclRelRTI);
+	READ_NODE_FIELD(exclRelTlist);
+
+	READ_DONE();
+}
+
+/*
+ * _readAppend
+ */
+static Append *
+_readAppend(void)
+{
+	READ_LOCALS(Append);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_NODE_FIELD(appendplans);
+
+	READ_DONE();
+}
+
+/*
+ * _readMergeAppend
+ */
+static MergeAppend *
+_readMergeAppend(void)
+{
+	READ_LOCALS(MergeAppend);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_NODE_FIELD(mergeplans);
+	READ_INT_FIELD(numCols);
+	READ_ATTRNUMBER_ARRAY(sortColIdx, local_node->numCols);
+	READ_OID_ARRAY(sortOperators, local_node->numCols);
+	READ_OID_ARRAY(collations, local_node->numCols);
+	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readRecursiveUnion
+ */
+static RecursiveUnion *
+_readRecursiveUnion(void)
+{
+	READ_LOCALS(RecursiveUnion);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_INT_FIELD(wtParam);
+	READ_INT_FIELD(numCols);
+	READ_ATTRNUMBER_ARRAY(dupColIdx, local_node->numCols);
+	READ_OID_ARRAY(dupOperators, local_node->numCols);
+	READ_LONG_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapAnd
+ */
+static BitmapAnd *
+_readBitmapAnd(void)
+{
+	READ_LOCALS(BitmapAnd);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_NODE_FIELD(bitmapplans);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapOr
+ */
+static BitmapOr *
+_readBitmapOr(void)
+{
+	READ_LOCALS(BitmapOr);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_NODE_FIELD(bitmapplans);
+
+	READ_DONE();
+}
+
+/*
+ * ReadCommonScan
+ *	Assign the basic stuff of all nodes that inherit from Scan
+ */
+static void
+ReadCommonScan(Scan *local_node)
+{
+	READ_TEMP_LOCALS();
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_UINT_FIELD(scanrelid);
+}
+
+/*
+ * _readScan
+ */
+static Scan *
+_readScan(void)
+{
+	READ_LOCALS_NO_FIELDS(Scan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readSeqScan
+ */
+static SeqScan *
+_readSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(SeqScan);
+
+	ReadCommonScan((Scan *) local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readSampleScan
+ */
+static SampleScan *
+_readSampleScan(void)
+{
+	READ_LOCALS(SampleScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_NODE_FIELD(tablesample);
+
+	READ_DONE();
+}
+
+/*
+ * _readIndexScan
+ */
+static IndexScan *
+_readIndexScan(void)
+{
+	READ_LOCALS(IndexScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_OID_FIELD(indexid);
+	READ_NODE_FIELD(indexqual);
+	READ_NODE_FIELD(indexqualorig);
+	READ_NODE_FIELD(indexorderby);
+	READ_NODE_FIELD(indexorderbyorig);
+	READ_NODE_FIELD(indexorderbyops);
+	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+
+	READ_DONE();
+}
+
+/*
+ * _readIndexOnlyScan
+ */
+static IndexOnlyScan *
+_readIndexOnlyScan(void)
+{
+	READ_LOCALS(IndexOnlyScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_OID_FIELD(indexid);
+	READ_NODE_FIELD(indexqual);
+	READ_NODE_FIELD(indexorderby);
+	READ_NODE_FIELD(indextlist);
+	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapIndexScan
+ */
+static BitmapIndexScan *
+_readBitmapIndexScan(void)
+{
+	READ_LOCALS(BitmapIndexScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_OID_FIELD(indexid);
+	READ_NODE_FIELD(indexqual);
+	READ_NODE_FIELD(indexqualorig);
+
+	READ_DONE();
+}
+
+/*
+ * _readBitmapHeapScan
+ */
+static BitmapHeapScan *
+_readBitmapHeapScan(void)
+{
+	READ_LOCALS(BitmapHeapScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_NODE_FIELD(bitmapqualorig);
+
+	READ_DONE();
+}
+
+/*
+ * _readTidScan
+ */
+static TidScan *
+_readTidScan(void)
+{
+	READ_LOCALS(TidScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_NODE_FIELD(tidquals);
+
+	READ_DONE();
+}
+
+/*
+ * _readSubqueryScan
+ */
+static SubqueryScan *
+_readSubqueryScan(void)
+{
+	READ_LOCALS(SubqueryScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_NODE_FIELD(subplan);
+
+	READ_DONE();
+}
+
+/*
+ * _readFunctionScan
+ */
+static FunctionScan *
+_readFunctionScan(void)
+{
+	READ_LOCALS(FunctionScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_NODE_FIELD(functions);
+	READ_BOOL_FIELD(funcordinality);
+
+	READ_DONE();
+}
+
+/*
+ * _readValuesScan
+ */
+static ValuesScan *
+_readValuesScan(void)
+{
+	READ_LOCALS(ValuesScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_NODE_FIELD(values_lists);
+
+	READ_DONE();
+}
+
+/*
+ * _readCteScan
+ */
+static CteScan *
+_readCteScan(void)
+{
+	READ_LOCALS(CteScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_INT_FIELD(ctePlanId);
+	READ_INT_FIELD(cteParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readWorkTableScan
+ */
+static WorkTableScan *
+_readWorkTableScan(void)
+{
+	READ_LOCALS(WorkTableScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_INT_FIELD(wtParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readForeignScan
+ */
+static ForeignScan *
+_readForeignScan(void)
+{
+	READ_LOCALS(ForeignScan);
+
+	ReadCommonScan((Scan *) &local_node->scan);
+
+	READ_OID_FIELD(fs_server);
+	READ_NODE_FIELD(fdw_exprs);
+	READ_NODE_FIELD(fdw_private);
+	READ_NODE_FIELD(fdw_scan_tlist);
+	READ_BITMAPSET_FIELD(fs_relids);
+	READ_BOOL_FIELD(fsSystemCol);
+
+	READ_DONE();
+}
+
+/*
+ * ReadCommonJoin
+ *	Assign the basic stuff of all nodes that inherit from Join
+ */
+static void
+ReadCommonJoin(Join *local_node)
+{
+	READ_TEMP_LOCALS();
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_ENUM_FIELD(jointype, JoinType);
+	READ_NODE_FIELD(joinqual);
+}
+
+/*
+ * _readJoin
+ */
+static Join *
+_readJoin(void)
+{
+	READ_LOCALS_NO_FIELDS(Join);
+
+	ReadCommonJoin(local_node);
+
+	READ_DONE();
+}
+
+/*
+ * _readNestLoop
+ */
+static NestLoop *
+_readNestLoop(void)
+{
+	READ_LOCALS(NestLoop);
+
+	ReadCommonJoin((Join *) &local_node->join);
+
+	READ_NODE_FIELD(nestParams);
+
+	READ_DONE();
+}
+
+/*
+ * _readMergeJoin
+ */
+static MergeJoin *
+_readMergeJoin(void)
+{
+	int			numCols;
+
+	READ_LOCALS(MergeJoin);
+
+	ReadCommonJoin((Join *) &local_node->join);
+
+	READ_NODE_FIELD(mergeclauses);
+
+	numCols = list_length(local_node->mergeclauses);
+
+	READ_OID_ARRAY(mergeFamilies, numCols);
+	READ_OID_ARRAY(mergeCollations, numCols);
+	READ_INT_ARRAY(mergeStrategies, numCols);
+	READ_BOOL_ARRAY(mergeNullsFirst, numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readHashJoin
+ */
+static HashJoin *
+_readHashJoin(void)
+{
+	READ_LOCALS(HashJoin);
+
+	ReadCommonJoin((Join *) &local_node->join);
+
+	READ_NODE_FIELD(hashclauses);
+
+	READ_DONE();
+}
+
+/*
+ * _readMaterial
+ */
+static Material *
+_readMaterial(void)
+{
+	READ_LOCALS_NO_FIELDS(Material);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_DONE();
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+	READ_LOCALS(Sort);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_INT_FIELD(numCols);
+	READ_ATTRNUMBER_ARRAY(sortColIdx, local_node->numCols);
+	READ_OID_ARRAY(sortOperators, local_node->numCols);
+	READ_OID_ARRAY(collations, local_node->numCols);
+	READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroup
+ */
+static Group *
+_readGroup(void)
+{
+	READ_LOCALS(Group);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_INT_FIELD(numCols);
+	READ_ATTRNUMBER_ARRAY(grpColIdx, local_node->numCols);
+	READ_OID_ARRAY(grpOperators, local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readAgg
+ */
+static Agg *
+_readAgg(void)
+{
+	READ_LOCALS(Agg);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_ENUM_FIELD(aggstrategy, AggStrategy);
+	READ_INT_FIELD(numCols);
+	READ_ATTRNUMBER_ARRAY(grpColIdx, local_node->numCols);
+	READ_OID_ARRAY(grpOperators, local_node->numCols);
+	READ_LONG_FIELD(numGroups);
+	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(chain);
+
+	READ_DONE();
+}
+
+/*
+ * _readWindowAgg
+ */
+static WindowAgg *
+_readWindowAgg(void)
+{
+	READ_LOCALS(WindowAgg);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_UINT_FIELD(winref);
+	READ_INT_FIELD(partNumCols);
+	READ_ATTRNUMBER_ARRAY(partColIdx, local_node->partNumCols);
+	READ_OID_ARRAY(partOperators, local_node->partNumCols);
+	READ_INT_FIELD(ordNumCols);
+	READ_ATTRNUMBER_ARRAY(ordColIdx, local_node->ordNumCols);
+	READ_OID_ARRAY(ordOperators, local_node->ordNumCols);
+	READ_INT_FIELD(frameOptions);
+	READ_NODE_FIELD(startOffset);
+	READ_NODE_FIELD(endOffset);
+
+	READ_DONE();
+}
+
+/*
+ * _readUnique
+ */
+static Unique *
+_readUnique(void)
+{
+	READ_LOCALS(Unique);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_INT_FIELD(numCols);
+	READ_ATTRNUMBER_ARRAY(uniqColIdx, local_node->numCols);
+	READ_OID_ARRAY(uniqOperators, local_node->numCols);
+
+	READ_DONE();
+}
+
+/*
+ * _readHash
+ */
+static Hash *
+_readHash(void)
+{
+	READ_LOCALS(Hash);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_OID_FIELD(skewTable);
+	READ_INT_FIELD(skewColumn);
+	READ_BOOL_FIELD(skewInherit);
+	READ_OID_FIELD(skewColType);
+	READ_INT_FIELD(skewColTypmod);
+
+	READ_DONE();
+}
+
+/*
+ * _readSetOp
+ */
+static SetOp *
+_readSetOp(void)
+{
+	READ_LOCALS(SetOp);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_ENUM_FIELD(cmd, SetOpCmd);
+	READ_ENUM_FIELD(strategy, SetOpStrategy);
+	READ_INT_FIELD(numCols);
+	READ_ATTRNUMBER_ARRAY(dupColIdx, local_node->numCols);
+	READ_OID_ARRAY(dupOperators, local_node->numCols);
+	READ_INT_FIELD(flagColIdx);
+	READ_INT_FIELD(firstFlag);
+	READ_LONG_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
+ * _readLockRows
+ */
+static LockRows *
+_readLockRows(void)
+{
+	READ_LOCALS(LockRows);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_NODE_FIELD(rowMarks);
+	READ_INT_FIELD(epqParam);
+
+	READ_DONE();
+}
+
+/*
+ * _readLimit
+ */
+static Limit *
+_readLimit(void)
+{
+	READ_LOCALS(Limit);
+
+	ReadCommonPlan((Plan *) &local_node->plan);
+
+	READ_NODE_FIELD(limitOffset);
+	READ_NODE_FIELD(limitCount);
+
+	READ_DONE();
+}
+
+/*
+ * _readNestLoopParam
+ */
+static NestLoopParam *
+_readNestLoopParam(void)
+{
+	READ_LOCALS(NestLoopParam);
+
+	READ_INT_FIELD(paramno);
+	READ_NODE_FIELD(paramval);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlanRowMark
+ */
+static PlanRowMark *
+_readPlanRowMark(void)
+{
+	READ_LOCALS(PlanRowMark);
+
+	READ_UINT_FIELD(rti);
+	READ_UINT_FIELD(prti);
+	READ_UINT_FIELD(rowmarkId);
+	READ_ENUM_FIELD(markType, RowMarkType);
+	READ_INT_FIELD(allMarkTypes);
+	READ_ENUM_FIELD(strength, LockClauseStrength);
+	READ_ENUM_FIELD(waitPolicy, LockWaitPolicy);
+	READ_BOOL_FIELD(isParent);
+
+	READ_DONE();
+}
+
+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+	READ_LOCALS(PlanInvalItem);
+
+	READ_INT_FIELD(cacheId);
+	READ_UINT_FIELD(hashValue);
+
+	READ_DONE();
+}
+
+/*
+ * _readSubPlan
+ */
+static SubPlan *
+_readSubPlan(void)
+{
+	READ_LOCALS(SubPlan);
+
+	READ_ENUM_FIELD(subLinkType, SubLinkType);
+	READ_NODE_FIELD(testexpr);
+	READ_NODE_FIELD(paramIds);
+	READ_INT_FIELD(plan_id);
+	READ_STRING_FIELD(plan_name);
+	READ_OID_FIELD(firstColType);
+	READ_INT_FIELD(firstColTypmod);
+	READ_OID_FIELD(firstColCollation);
+	READ_BOOL_FIELD(useHashTable);
+	READ_BOOL_FIELD(unknownEqFalse);
+	READ_NODE_FIELD(setParam);
+	READ_NODE_FIELD(parParam);
+	READ_NODE_FIELD(args);
+	READ_FLOAT_FIELD(startup_cost);
+	READ_FLOAT_FIELD(per_call_cost);
+
+	READ_DONE();
+}
+
+/*
+ * _readAlternativeSubPlan
+ */
+static AlternativeSubPlan *
+_readAlternativeSubPlan(void)
+{
+	READ_LOCALS(AlternativeSubPlan);
+
+	READ_NODE_FIELD(subplans);
+
+	READ_DONE();
+}
 
 /*
  * parseNodeString
@@ -1504,8 +2338,94 @@ parseNodeString(void)
 		return_value = _readTableSampleClause();
 	else if (MATCH("NOTIFY", 6))
 		return_value = _readNotifyStmt();
+	else if (MATCH("DEFELEM", 7))
+		return_value = _readDefElem();
 	else if (MATCH("DECLARECURSOR", 13))
 		return_value = _readDeclareCursorStmt();
+	else if (MATCH("PLANNEDSTMT", 11))
+		return_value = _readPlannedStmt();
+	else if (MATCH("PLAN", 4))
+		return_value = _readPlan();
+	else if (MATCH("RESULT", 6))
+		return_value = _readResult();
+	else if (MATCH("MODIFYTABLE", 11))
+		return_value = _readModifyTable();
+	else if (MATCH("APPEND", 6))
+		return_value = _readAppend();
+	else if (MATCH("MERGEAPPEND", 11))
+		return_value = _readMergeAppend();
+	else if (MATCH("RECURSIVEUNION", 14))
+		return_value = _readRecursiveUnion();
+	else if (MATCH("BITMAPAND", 9))
+		return_value = _readBitmapAnd();
+	else if (MATCH("BITMAPOR", 8))
+		return_value = _readBitmapOr();
+	else if (MATCH("SCAN", 4))
+		return_value = _readScan();
+	else if (MATCH("SEQSCAN", 7))
+		return_value = _readSeqScan();
+	else if (MATCH("SAMPLESCAN", 10))
+		return_value = _readSampleScan();
+	else if (MATCH("INDEXSCAN", 9))
+		return_value = _readIndexScan();
+	else if (MATCH("INDEXONLYSCAN", 13))
+		return_value = _readIndexOnlyScan();
+	else if (MATCH("BITMAPINDEXSCAN", 15))
+		return_value = _readBitmapIndexScan();
+	else if (MATCH("BITMAPHEAPSCAN", 14))
+		return_value = _readBitmapHeapScan();
+	else if (MATCH("TIDSCAN", 7))
+		return_value = _readTidScan();
+	else if (MATCH("SUBQUERYSCAN", 12))
+		return_value = _readSubqueryScan();
+	else if (MATCH("FUNCTIONSCAN", 12))
+		return_value = _readFunctionScan();
+	else if (MATCH("VALUESSCAN", 10))
+		return_value = _readValuesScan();
+	else if (MATCH("CTESCAN", 7))
+		return_value = _readCteScan();
+	else if (MATCH("WORKTABLESCAN", 13))
+		return_value = _readWorkTableScan();
+	else if (MATCH("FOREIGNSCAN", 11))
+		return_value = _readForeignScan();
+	else if (MATCH("JOIN", 4))
+		return_value = _readJoin();
+	else if (MATCH("NESTLOOP", 8))
+		return_value = _readNestLoop();
+	else if (MATCH("MERGEJOIN", 9))
+		return_value = _readMergeJoin();
+	else if (MATCH("HASHJOIN", 8))
+		return_value = _readHashJoin();
+	else if (MATCH("MATERIAL", 8))
+		return_value = _readMaterial();
+	else if (MATCH("SORT", 4))
+		return_value = _readSort();
+	else if (MATCH("GROUP", 5))
+		return_value = _readGroup();
+	else if (MATCH("AGG", 3))
+		return_value = _readAgg();
+	else if (MATCH("WINDOWAGG", 9))
+		return_value = _readWindowAgg();
+	else if (MATCH("UNIQUE", 6))
+		return_value = _readUnique();
+	else if (MATCH("HASH", 4))
+		return_value = _readHash();
+	else if (MATCH("SETOP", 5))
+		return_value = _readSetOp();
+	else if (MATCH("LOCKROWS", 8))
+		return_value = _readLockRows();
+	else if (MATCH("LIMIT", 5))
+		return_value = _readLimit();
+	else if (MATCH("NESTLOOPPARAM", 13))
+		return_value = _readNestLoopParam();
+	else if (MATCH("PLANROWMARK", 11))
+		return_value = _readPlanRowMark();
+	else if (MATCH("PLANINVALITEM", 13))
+		return_value = _readPlanInvalItem();
+	else if (MATCH("SUBPLAN", 7))
+		return_value = _readSubPlan();
+	else if (MATCH("ALTERNATIVESUBPLAN", 18))
+		return_value = _readAlternativeSubPlan();
 	else
 	{
 		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
@@ -1576,3 +2496,99 @@ readDatum(bool typbyval)
 
 	return res;
 }
+
+/*
+ * readAttrNumberCols
+ */
+static AttrNumber *
+readAttrNumberCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	AttrNumber *attr_vals;
+
+	if (numCols <= 0)
+		return NULL;
+
+	attr_vals = (AttrNumber *) palloc(numCols * sizeof(AttrNumber));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		attr_vals[i] = atoi(token);
+	}
+
+	return attr_vals;
+}
+
+/*
+ * readOidCols
+ */
+static Oid *
+readOidCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	Oid		   *oid_vals;
+
+	if (numCols <= 0)
+		return NULL;
+
+	oid_vals = (Oid *) palloc(numCols * sizeof(Oid));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		oid_vals[i] = atooid(token);
+	}
+
+	return oid_vals;
+}
+
+/*
+ * readIntCols
+ */
+static int *
+readIntCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	int		   *int_vals;
+
+	if (numCols <= 0)
+		return NULL;
+
+	int_vals = (int *) palloc(numCols * sizeof(int));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		int_vals[i] = atoi(token);
+	}
+
+	return int_vals;
+}
+
+/*
+ * readBoolCols
+ */
+static bool *
+readBoolCols(int numCols)
+{
+	int			tokenLength,
+				i;
+	char	   *token;
+	bool	   *bool_vals;
+
+	if (numCols <= 0)
+		return NULL;
+
+	bool_vals = (bool *) palloc(numCols * sizeof(bool));
+	for (i = 0; i < numCols; i++)
+	{
+		token = pg_strtok(&tokenLength);
+		bool_vals[i] = strtobool(token);
+	}
+
+	return bool_vals;
+}
read_funcs_test_v2.patchapplication/octet-stream; name=read_funcs_test_v2.patchDownload
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 06be922..635f47b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -135,6 +135,171 @@ static Plan *build_grouping_chain(PlannerInfo *root,
 					 long numGroups,
 					 Plan *result_plan);
 
+/*
+ * fix_node_funcids
+ *		Set the opfuncid (procedure OID) in an OpExpr node,
+ *		for plan tree.
+ *
+ * We need it mainly to fix the opfuncid in nodes of plantree
+ * after reading the planned statement by worker backend.
+ */
+static void
+fix_node_funcids(Plan *node)
+{
+	ListCell   *temp;
+
+	/*
+	 * do nothing when we get to the end of a leaf on tree.
+	 */
+	if (node == NULL)
+		return;
+
+	fix_opfuncids((Node*) node->qual);
+	fix_opfuncids((Node*) node->targetlist);
+
+	switch (nodeTag(node))
+	{
+		case T_Result:
+			fix_opfuncids((Node*) (((Result *) node)->resconstantqual));
+			break;
+		case T_ModifyTable:
+			foreach(temp, (List *) ((ModifyTable *) node)->plans)
+				fix_node_funcids((Plan *) lfirst(temp));
+
+			fix_opfuncids((Node*) (((ModifyTable *) node)->withCheckOptionLists));
+			fix_opfuncids((Node*) (((ModifyTable *) node)->returningLists));
+			/*
+			 * we should fix funcids for fdwPrivLists, but it is not
+			 * clear what kind of expressions it can contain.
+			 */
+			fix_opfuncids((Node*) (((ModifyTable *) node)->onConflictSet));
+			fix_opfuncids((Node*) (((ModifyTable *) node)->onConflictWhere));
+			fix_opfuncids((Node*) (((ModifyTable *) node)->exclRelTlist));
+			break;
+		case T_Append:
+			foreach(temp, (List *) ((Append *) node)->appendplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_MergeAppend:
+			foreach(temp, (List *) ((MergeAppend *) node)->mergeplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_RecursiveUnion:
+			break;
+		case T_BitmapAnd:
+			foreach(temp, (List *) ((BitmapAnd *) node)->bitmapplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_BitmapOr:
+			foreach(temp, (List *) ((BitmapOr *) node)->bitmapplans)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_Scan:
+			break;
+		case T_SeqScan:
+			break;
+		case T_SampleScan:
+			fix_opfuncids((Node*) (((SampleScan *) node)->tablesample));
+			break;
+		case T_IndexScan:
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexqual));
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexqualorig));
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexorderby));
+			fix_opfuncids((Node*) (((IndexScan *) node)->indexorderbyorig));
+			break;
+		case T_IndexOnlyScan:
+			fix_opfuncids((Node*) (((IndexOnlyScan *) node)->indexqual));
+			fix_opfuncids((Node*) (((IndexOnlyScan *) node)->indexorderby));
+			fix_opfuncids((Node*) (((IndexOnlyScan *) node)->indextlist));
+			break;
+		case T_BitmapIndexScan:
+			fix_opfuncids((Node*) (((BitmapIndexScan *) node)->indexqual));
+			fix_opfuncids((Node*) (((BitmapIndexScan *) node)->indexqualorig));
+			break;
+		case T_BitmapHeapScan:
+			fix_opfuncids((Node*) (((BitmapHeapScan *) node)->bitmapqualorig));
+			break;
+		case T_TidScan:
+			fix_opfuncids((Node*) (((TidScan *) node)->tidquals));
+			break;
+		case T_SubqueryScan:
+			fix_node_funcids((Plan *) ((SubqueryScan *) node)->subplan);
+			break;
+		case T_FunctionScan:
+			fix_opfuncids((Node*) (((FunctionScan *) node)->functions));
+			break;
+		case T_ValuesScan:
+			fix_opfuncids((Node*) (((ValuesScan *) node)->values_lists));
+			break;
+		case T_CteScan:
+			break;
+		case T_WorkTableScan:
+			break;
+		case T_ForeignScan:
+			fix_opfuncids((Node*) (((ForeignScan *) node)->fdw_exprs));
+			/*
+			 * we should fix funcids for fdw_private, but it is not
+			 * clear what kind of expressions it can contain.
+			 */
+			fix_opfuncids((Node*) (((ForeignScan *) node)->fdw_scan_tlist));
+			break;
+		case T_Join:
+			fix_opfuncids((Node*) (((Join *) node)->joinqual));
+			break;
+		case T_NestLoop:
+			fix_opfuncids((Node*) (((NestLoop *) node)->join.joinqual));
+			foreach(temp, (List *) ((NestLoop*) node)->nestParams)
+				fix_opfuncids((Node*) ((NestLoopParam *) lfirst(temp))->paramval);
+			break;
+		case T_MergeJoin:
+			fix_opfuncids((Node*) (((MergeJoin *) node)->join.joinqual));
+			fix_opfuncids((Node*) (((MergeJoin *) node)->mergeclauses));
+			break;
+		case T_HashJoin:
+			fix_opfuncids((Node*) (((HashJoin *) node)->join.joinqual));
+			fix_opfuncids((Node*) (((HashJoin *) node)->hashclauses));
+			break;
+		case T_Material:
+			break;
+		case T_Sort:
+			break;
+		case T_Group:
+			break;
+		case T_Agg:
+			foreach(temp, (List *) ((Agg *) node)->chain)
+				fix_node_funcids((Plan *) lfirst(temp));
+			break;
+		case T_WindowAgg:
+			fix_opfuncids((Node*) (((WindowAgg *) node)->startOffset));
+			fix_opfuncids((Node*) (((WindowAgg *) node)->endOffset));
+			break;
+		case T_Unique:
+			break;
+		case T_Hash:
+			break;
+		case T_SetOp:
+			break;
+		case T_LockRows:
+			break;
+		case T_Limit:
+			fix_opfuncids((Node*) (((Limit *) node)->limitOffset));
+			fix_opfuncids((Node*) (((Limit *) node)->limitCount));
+			break;
+		case T_SubPlan:
+			fix_opfuncids((Node*) ((SubPlan *) node));
+			break;
+		case T_AlternativeSubPlan:
+			fix_opfuncids((Node*) ((AlternativeSubPlan *) node));
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	fix_node_funcids(node->lefttree);
+	fix_node_funcids(node->righttree);
+}
+
 /*****************************************************************************
  *
  *	   Query optimizer entry point
@@ -152,12 +317,25 @@ PlannedStmt *
 planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 {
 	PlannedStmt *result;
+	PlannedStmt *verify_plannedstmt;
+	char		*plannedstmt_str;
+	ListCell   *temp;
 
 	if (planner_hook)
 		result = (*planner_hook) (parse, cursorOptions, boundParams);
 	else
 		result = standard_planner(parse, cursorOptions, boundParams);
-	return result;
+
+	plannedstmt_str = nodeToString(result);
+
+	verify_plannedstmt = (PlannedStmt *) stringToNode(plannedstmt_str);
+
+	fix_node_funcids(verify_plannedstmt->planTree);
+
+	foreach(temp, (List *) verify_plannedstmt->subplans)
+		fix_node_funcids((Plan *) lfirst(temp));
+
+	return verify_plannedstmt;
 }
 
 PlannedStmt *
#347Amit Kapila
amit.kapila16@gmail.com
In reply to: Kouhei Kaigai (#341)
Re: Parallel Seq Scan

On Wed, Sep 23, 2015 at 6:42 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Thu, Sep 17, 2015 at 4:44 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

To verify the patch, I have done 2 things, first I have added elog to
the newly supported read funcs and then in planner, I have used
nodeToString and stringToNode on planned_stmt and then used the
newly generated planned_stmt for further execution. After making these
changes, I have ran make check-world and ensures that it covers all the
newly added nodes.

Note, that as we don't populate funcid's in expressions during read, the
same has to be updated by traversing the tree and updating in different
expressions based on node type. Attached patch (read_funcs_test_v1)
contains the changes required for testing the patch. I am not very sure
about what do about some of the ForeignScan fields (fdw_private) in

order

to update the funcid as the data in those expressions could be FDW

specific.

This is anyway for test, so doesn't matter much, but the same will be
required to support read of ForeignScan node by worker.

Because of interface contract, it is role of FDW driver to put nodes which
are safe to copyObject on fdw_exprs and fdw_private field. Unless FDW

driver

does not violate, fdw_exprs and fdw_private shall be reproduced on the

worker

side. (Of course, we cannot guarantee nobody has local pointer on private
field...)
Sorry, I cannot understand the sentence of funcid population. It seems to

me

funcid is displayed as-is, and _readFuncExpr() does nothing special...?

Here I am referring to operator's funcid (refer function _readOpExpr()).
It is
hard-coded to InvalidOid during read. Currently fix_opfuncids() is used to
fill the funcid for expressions, now I am not sure if it can be used for
fdw_private data (I have tried it, but it was failing, may be it is not at
all
required to fix any funcid for it) or other fields in Foreign Scan.

I have observed that in out functions, we output fdw_private field which
indicates that ideally we should be able to read it.

Robert Haas said:

I think it would be worth doing something like this:

#define READ_ATTRNUMBER_ARRAY(fldname, len) \
token = pg_strtok(&length); \
local_node->fldname = readAttrNumberCols(len);

And similarly for READ_OID_ARRAY, READ_BOOL_ARRAY, READ_INT_ARRAY.

Like this?

https://github.com/kaigai/sepgsql/blob/readfuncs/src/backend/nodes/readfuncs.c#L133

I think outfuncs.c also have similar macro to centralize the format of

array.

Actually, most of boolean array are displayed using booltostr(), however,

only

_outMergeJoin() uses "%d" format to display boolean as integer.
It is a bit inconsistent manner.

Yes, I have also noticed the same and thought of sending a patch which I
have done just before sending this mail.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#348Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#346)
Re: Parallel Seq Scan

On Wed, Sep 23, 2015 at 3:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Instead of passing the Plan down by casting, how about passing
&local_node->plan? And similarly for scans and joins.

Changed as per suggestion.

The point of this change was to make it so that we wouldn't need the
casts any more. You changed it so we didn't, but then didn't actually
get rid of them. I did that, tweaked a comment, and committed this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#349Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#338)
Re: Parallel Seq Scan

On Tue, Sep 22, 2015 at 3:14 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

copy_user_generic_string system call is because of file read operations.
In my test, I gave the shared_buffers as 12GB with the table size of 18GB.

OK, cool. So that's actually good: all that work would have to be
done either way, and parallelism lets several CPUs work on it at once.

The _spin_lock calls are from the signals that are generated by the workers.
With the increase of tuple queue size, there is a change in kernel system
calls usage.

And this part is not so good: that's additional work created by
parallelism that wouldn't have to be done if we weren't in parallel
mode. Of course, it's impossible to eliminate that, but we should try
to reduce it.

- From the above performance readings, increase of tuple queue size
gets benefited with lesser
number of workers compared to higher number of workers.

That makes sense to me, because there's a separate queue for each
worker. If we have more workers, then the total amount of queue space
available rises in proportion to the number of workers available.

Workers are getting started irrespective of the system load. If user
configures 16 workers, but
because of a sudden increase in the system load, there are only 2 or 3
cpu's are only IDLE.
In this case, if any parallel seq scan eligible query is executed, the
backend may start 16 workers
thus it can lead to overall increase of system usage and may decrease
the performance of the
other backend sessions?

Yep, that could happen. It's something we should work on, but the
first version isn't going to try to be that smart. It's similar to
the problem we already have with work_mem, and I want to work on it,
but we need to get this working first.

If the query have two parallel seq scan plan nodes and how the workers
will be distributed across
the two nodes? Currently parallel_seqscan_degree is used per plan
node, even if we change that
to per query, I think we need a worker distribution logic, instead of
using all workers by a single
plan node.

Yes, we need that, too. Again, at some point.

Select with a limit clause is having a performance drawback with
parallel seq scan in some scenarios,
because of very less selectivity compared to seq scan, it should be
better if we document it. Users
can take necessary actions based on that for the queries with limit clause.

This is something I want to think further about in the near future.
We don't have a great plan for shutting down workers when no further
tuples are needed because, for example, an upper node has filled a
limit. That makes using parallel query in contexts like Limit and
InitPlan significantly more costly than you might expect. Perhaps we
should avoid parallel plans altogether in those contexts, or maybe
there is some other approach that can work. I haven't figured it out
yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#350Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#309)
Re: Parallel Seq Scan

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

More review comments:

ExecParallelEstimate() and ExecParallelInitializeDSM() should use
planstate_tree_walker to iterate over the whole planstate tree.
That's because the parallel node need not be at the top of the tree.
By using the walker, you'll find plan state nodes that need work of
the relevant type even if they are deeply buried. The logic should be
something like:

if (node == NULL) return false;
switch (nodeTag(node)) { ... /* parallel aware node enumerated here */ }
return planstate_tree_walker(node, ExecParallelEstimate, context);

The function signature should be changed to bool
ExecParallelEstimate(PlanState *planstate, parallel_estimate_ctx
*context) where parallel_estimate_ctx is a structure containing
ParallelContext *context and Size *psize. The comment about handling
only a few node types should go away, because by using the
planstate_tree_walker we can iterate over anything.

It looks to me like there would be trouble if an initPlan or subPlan
were kept below a Funnel, or as I guess we're going to call it, a
Gather node. That's because a SubPlan doesn't actually have a pointer
to the node tree for the sub-plan in it. It just has an index into
PlannedStmt.subplans. But create_parallel_worker_plannedstmt sets the
subplans list to NIL. So that's not gonna work. Now maybe there's no
way for an initPlan or a subPlan to creep down under the funnel, but I
don't immediately see what would prevent it.

+ * There should be atleast thousand pages to scan for each worker.

"at least a thousand"

+cost_patialseqscan(Path *path, PlannerInfo *root,

patial->partial.

I also don't see where you are checking that a partial seq scan has
nothing parallel-restricted in its quals.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#351Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#350)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patches ]

Still more review comments:

+                               /* Allow space for terminating zero-byte */
+                               size = add_size(size, 1);

This is pointless. The length is already stored separately, and if it
weren't, this wouldn't be adequate anyway because a varlena can
contain NUL bytes. It won't if it's text, but it could be bytea or
numeric or whatever.

RestoreBoundParams is broken, because it can do unaligned reads, which
will core dump on some architectures (and merely be slow on others).
If there are two or more parameters, and the first one is a varlena
with a length that is not a multiple of MAXIMUM_ALIGNOF, the second
SerializedParamExternData will be misaligned.

Also, it's pretty lame that we send the useless pointer even for a
pass-by-reference data type and then overwrite the bad pointer with a
good one a few lines later. It would be better to design the
serialization format so that we don't send the bogus pointer over the
wire in the first place.

It's also problematic in my view that there is so much duplicated code
here. SerializedParamExternData and SerializedParamExecData are very
similar and there are large swaths of very similar code to handle each
case. Both structures contain Datum value, Size length, bool isnull,
and Oid ptype, albeit not in the same order for some reason. The only
difference is that SerializedParamExternData contains uint16 pflags
and SerializedParamExecData contains int paramid. I think we need to
refactor this code to get rid of all this duplication. I suggest that
we decide to represent a datum here in a uniform fashion, perhaps like
this:

First, store a 4-byte header word. If this is -2, the value is NULL
and no data follows. If it's -1, the value is pass-by-value and
sizeof(Datum) bytes follow. If it's >0, the value is
pass-by-reference and the value gives the number of following bytes
that should be copied into a brand-new palloc'd chunk.

Using a format like this, we can serialize and restore datums in
various contexts - including bind and exec params - without having to
rewrite the code each time. For example, for param extern data, you
can dump an array of all the ptypes and paramids and then follow it
with all of the params one after another. For param exec data, you
can dump an array of all the ptypes and paramids and then follow it
with the values one after another. The code that reads and writes the
datums in both cases can be the same. If we need to send datums in
other contexts, we can use the same code for it.

The attached patch - which I even tested! - shows what I have in mind.
It can save and restore the ParamListInfo (bind parameters). I
haven't tried to adapt it to the exec parameters because I don't quite
understand what you are doing there yet, but you can see that the
datum-serialization logic is separated from the stuff that knows about
the details of ParamListInfo, so datumSerialize() should be reusable
for other purposes. This also doesn't have the other problems
mentioned above.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

datum-and-param-serialize.patchapplication/x-patch; name=datum-and-param-serialize.patchDownload
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..d093263 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
@@ -73,3 +74,157 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize a ParamListInfo.
+ */
+Size
+EstimateParamListSpace(ParamListInfo paramLI)
+{
+	int		i;
+	Size	sz = sizeof(int);
+
+	if (paramLI == NULL || paramLI->numParams <= 0)
+		return sz;
+
+	for (i = 0; i < paramLI->numParams; i++)
+	{
+		ParamExternData *prm = &paramLI->params[i];
+		int16		typLen;
+		bool		typByVal;
+
+		/* give hook a chance in case parameter is dynamic */
+		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+			(*paramLI->paramFetch) (paramLI, i + 1);
+
+		sz = add_size(sz, sizeof(Oid));			/* space for type OID */
+		sz = add_size(sz, sizeof(uint16));		/* space for pflags */
+
+		/* space for datum/isnull */
+		if (OidIsValid(prm->ptype))
+			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		else
+		{
+			/* If no type OID, assume by-value, like copyParamList does. */
+			typLen = sizeof(Datum);
+			typByVal = true;
+		}
+		sz = add_size(sz,
+			datumEstimateSpace(prm->value, prm->isnull, typByVal, typLen));
+	}
+
+	return sz;
+}
+
+/*
+ * Serialize a paramListInfo structure into caller-provided storage.
+ *
+ * We write the number of parameters first, as a 4-byte integer, and then
+ * write details for each parameter in turn.  The details for each parameter
+ * consist of a 4-byte type OID, 2 bytes of flags, and then the datum as
+ * serialized by datumSerialize().  The caller is responsible for ensuring
+ * that there is enough storage to store the number of bytes that will be
+ * written; use EstimateParamListSpace to find out how many will be needed.
+ * *start_address is updated to point to the byte immediately following those
+ * written.
+ *
+ * RestoreParamList can be used to recreate a ParamListInfo based on the
+ * serialized representation; this will be a static, self-contained copy
+ * just as copyParamList would create.
+ */
+void
+SerializeParamList(ParamListInfo paramLI, char **start_address)
+{
+	int			nparams;
+	int			i;
+
+	/* Write number of parameters. */
+	if (paramLI == NULL || paramLI->numParams <= 0)
+		nparams = 0;
+	else
+		nparams = paramLI->numParams;
+	memcpy(*start_address, &nparams, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* Write each parameter in turn. */
+	for (i = 0; i < nparams; i++)
+	{
+		ParamExternData *prm = &paramLI->params[i];
+		int16		typLen;
+		bool		typByVal;
+
+		/* give hook a chance in case parameter is dynamic */
+		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+			(*paramLI->paramFetch) (paramLI, i + 1);
+
+		/* Write type OID. */
+		memcpy(*start_address, &prm->ptype, sizeof(Oid));
+		*start_address += sizeof(Oid);
+
+		/* Write flags. */
+		memcpy(*start_address, &prm->pflags, sizeof(uint16));
+		*start_address += sizeof(uint16);
+
+		/* Write datum/isnull. */
+		if (OidIsValid(prm->ptype))
+			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		else
+		{
+			/* If no type OID, assume by-value, like copyParamList does. */
+			typLen = sizeof(Datum);
+			typByVal = true;
+		}
+		datumSerialize(prm->value, prm->isnull, typByVal, typLen,
+					   start_address);
+	}
+}
+
+/*
+ * Copy a ParamListInfo structure.
+ *
+ * The result is allocated in CurrentMemoryContext.
+ *
+ * Note: the intent of this function is to make a static, self-contained
+ * set of parameter values.  If dynamic parameter hooks are present, we
+ * intentionally do not copy them into the result.  Rather, we forcibly
+ * instantiate all available parameter values and copy the datum values.
+ */
+ParamListInfo
+RestoreParamList(char **start_address)
+{
+	ParamListInfo paramLI;
+	Size		size;
+	int			i;
+	int			nparams;
+
+	memcpy(&nparams, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+
+	size = offsetof(ParamListInfoData, params) +
+		nparams * sizeof(ParamExternData);
+
+	paramLI = (ParamListInfo) palloc(size);
+	paramLI->paramFetch = NULL;
+	paramLI->paramFetchArg = NULL;
+	paramLI->parserSetup = NULL;
+	paramLI->parserSetupArg = NULL;
+	paramLI->numParams = nparams;
+
+	for (i = 0; i < nparams; i++)
+	{
+		ParamExternData *prm = &paramLI->params[i];
+
+		/* Read type OID. */
+		memcpy(&prm->ptype, *start_address, sizeof(Oid));
+		*start_address += sizeof(Oid);
+
+		/* Read flags. */
+		memcpy(&prm->pflags, *start_address, sizeof(uint16));
+		*start_address += sizeof(uint16);
+
+		/* Read datum/isnull. */
+		prm->value = datumRestore(start_address, &prm->isnull);
+	}
+
+	return paramLI;
+}
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index e8af030..3d9e354 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -246,3 +246,121 @@ datumIsEqual(Datum value1, Datum value2, bool typByVal, int typLen)
 	}
 	return res;
 }
+
+/*-------------------------------------------------------------------------
+ * datumEstimateSpace
+ *
+ * Compute the amount of space that datumSerialize will require for a
+ * particular Datum.
+ *-------------------------------------------------------------------------
+ */
+Size
+datumEstimateSpace(Datum value, bool isnull, bool typByVal, int typLen)
+{
+	Size	sz = sizeof(int);
+
+	if (!isnull)
+	{
+		/* no need to use add_size, can't overflow */
+		if (typByVal)
+			sz += sizeof(Datum);
+		else
+			sz += datumGetSize(value, typByVal, typLen);
+	}
+
+	return sz;
+}
+
+/*-------------------------------------------------------------------------
+ * datumSerialize
+ *
+ * Serialize a possibly-NULL datum into caller-provided storage.
+ *
+ * The format is as follows: first, we write a 4-byte header word, which
+ * is either the length of a pass-by-reference datum, -1 for a
+ * pass-by-value datum, or -2 for a NULL.  If the value is NULL, nothing
+ * further is written.  If it is pass-by-value, sizeof(Datum) bytes
+ * follow.  Otherwise, the number of bytes indicated by the header word
+ * follow.  The caller is responsible for ensuring that there is enough
+ * storage to store the number of bytes that will be written; use
+ * datumEstimateSpace() to find out how many will be needed.
+ * *start_address is updated to point to the byte immediately following
+ * those written.
+ *-------------------------------------------------------------------------
+ */
+void
+datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
+			   char **start_address)
+{
+	int		header;
+
+	/* Write header word. */
+	if (isnull)
+		header = -2;
+	else if (typByVal)
+		header = -1;
+	else
+		header = datumGetSize(value, typByVal, typLen);
+	memcpy(*start_address, &header, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* If not null, write payload bytes. */
+	if (!isnull)
+	{
+		if (typByVal)
+		{
+			memcpy(*start_address, &value, sizeof(Datum));
+			*start_address += sizeof(Datum);
+		}
+		else
+		{
+			memcpy(*start_address, DatumGetPointer(value), header);
+			*start_address += header;
+		}
+	}
+}
+
+/*-------------------------------------------------------------------------
+ * datumRestore
+ *
+ * Restore a possibly-NULL datum previously serialized by datumSerialize.
+ * *start_address is updated according to the number of bytes consumed.
+ *-------------------------------------------------------------------------
+ */
+Datum
+datumRestore(char **start_address, bool *isnull)
+{
+	int		header;
+	void   *d;
+
+	/* Read header word. */
+	memcpy(&header, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* If this datum is NULL, we can stop here. */
+	if (header == -2)
+	{
+		*isnull = true;
+		return (Datum) 0;
+	}
+
+	/* OK, datum is not null. */
+	*isnull = false;
+
+	/* If this datum is pass-by-value, sizeof(Datum) bytes follow. */
+	if (header == -1)
+	{
+		Datum		val;
+
+		memcpy(&val, *start_address, sizeof(Datum));
+		*start_address += sizeof(Datum);
+		return val;
+	}
+
+	/* Pass-by-reference case; copy indicated number of bytes. */
+	Assert(header > 0);
+	d = palloc(header);
+	memcpy(d, *start_address, header);
+	*start_address += header;
+	return PointerGetDatum(d);
+}
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..83bebde 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -102,5 +102,8 @@ typedef struct ParamExecData
 
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
+extern Size EstimateParamListSpace(ParamListInfo paramLI);
+extern void SerializeParamList(ParamListInfo paramLI, char **start_address);
+extern ParamListInfo RestoreParamList(char **start_address);
 
 #endif   /* PARAMS_H */
diff --git a/src/include/utils/datum.h b/src/include/utils/datum.h
index c572f79..e9d4be5 100644
--- a/src/include/utils/datum.h
+++ b/src/include/utils/datum.h
@@ -46,4 +46,14 @@ extern Datum datumTransfer(Datum value, bool typByVal, int typLen);
 extern bool datumIsEqual(Datum value1, Datum value2,
 			 bool typByVal, int typLen);
 
+/*
+ * Serialize and restore datums so that we can transfer them to parallel
+ * workers.
+ */
+extern Size datumEstimateSpace(Datum value, bool isnull, bool typByVal,
+				   int typLen);
+extern void datumSerialize(Datum value, bool isnull, bool typByVal,
+			   int typLen, char **start_address);
+extern Datum datumRestore(char **start_address, bool *isnull);
+
 #endif   /* DATUM_H */
#352Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#318)
2 attachment(s)
Re: Parallel Seq Scan

On Thu, Sep 17, 2015 at 2:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have fixed most of the review comments raised in this mail
as well as other e-mails and rebased the patch on commit-
020235a5. Even though I have fixed many of the things, but
still quite a few comments are yet to be handled. This patch
is mainly a rebased version to ease the review. We can continue
to have discussion on the left over things and I will address
those in consecutive patches.

+ if (es->analyze && nodeTag(plan) == T_Funnel)

Why not IsA()?

Changed as per suggestion.

+ FinishParallelSetupAndAccumStats((FunnelState *)planstate);

Shouldn't there be a space before planstate?

Yes, fixed.

+ /* inform executor to collect buffer usage stats from parallel

workers. */

+ estate->total_time = queryDesc->totaltime ? 1 : 0;

Boy, the comment sure doesn't seem to match the code.

This is not required as per latest version of patch, so I have removed the
same.

+         * Accumulate the stats by parallel workers before stopping the
+         * node.

Suggest: "Accumulate stats from parallel workers before stopping node".

+             * If we are not able to send the tuple, then we assume that
+             * destination has closed and we won't be able to send any

more

+ * tuples so we just end the loop.

Suggest: "If we are not able to send the tuple, we assume the
destination has closed and no more tuples can be sent. If that's the
case, end the loop."

Changed as per suggestion.

+static void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo

params,

+                                 List *serialized_param_exec_vals,
+                                 int instOptions, Size *params_size,
+                                 Size *params_exec_size);
+static void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+                         List *serialized_param_exec_vals,
+                         int instOptions, Size params_size,
+                         Size params_exec_size,
+                         char **inst_options_space,
+                         char **buffer_usage_space);

Whitespace doesn't look like PostgreSQL style. Maybe run pgindent on
the newly-added files?

I have ran pgindent on all new files as well as changes for funnel related
patch (parallel_seqscan_funnel_v18.patch). I have not ran pgindent
for changes in partial seq scan patch, as I feel still there are couple of
things like handling of subplans needs to be done for that patch.

+/*
+ * This is required for parallel plan execution to fetch the information
+ * from dsm.
+ */

This comment doesn't really say anything. Can we get a better one?

Changed.

+    /*
+     * We expect each worker to populate the BufferUsage structure
+     * allocated by master backend and then master backend will aggregate
+     * all the usage along with it's own, so account it for each worker.
+     */

This also needs improvement. Especially because...

+    /*
+     * We expect each worker to populate the instrumentation structure
+     * allocated by master backend and then master backend will aggregate
+     * all the information, so account it for each worker.
+     */

...it's almost identical to this one.

Combined both of the comments.

+     * Store bind parameter's list in dynamic shared memory.  This is
+     * used for parameters in prepared query.

s/bind parameter's list/bind parameters/. I think you could drop the
second sentence, too.

+    /*
+     * Store PARAM_EXEC parameters list in dynamic shared memory.  This

is

+     * used for evaluation plan->initPlan params.
+     */

So is the previous block for PARAM_EXTERN and this is PARAM_EXEC? If
so, maybe that could be more clearly laid out.

Changed as per suggestion.

+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,

Could this be a static function? Will it really be needed outside this

file?

And is there any use case for letting some of the arguments be NULL?
Seems kind of an awkward API.

Nopes, so changed it.

+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+    if (node == NULL)
+        return false;
+
+    switch (nodeTag(node))
+    {
+        case T_FunnelState:
+            {
+                FinishParallelSetupAndAccumStats((FunnelState*)node);
+                return true;
+            }
+            break;
+        default:
+            break;
+    }
+
+    (void) planstate_tree_walker((Node*)((PlanState *)node)->lefttree,

NULL,

+                                 ExecParallelBufferUsageAccum, 0);
+    (void) planstate_tree_walker((Node*)((PlanState *)node)->righttree,

NULL,

+                                 ExecParallelBufferUsageAccum, 0);
+    return false;
+}

This seems wacky. I mean, isn't the point of planstate_tree_walker()
that the callback itself doesn't have to handle recursion like this?
And if not, then this wouldn't be adequate anyway, because some
planstate nodes have children that are not in lefttree or righttree
(cf. explain.c).

After rebasing, this part has changed entirely and I think now it is
as per your suggestions.

+    currentRelation = ExecOpenScanRelation(estate,
+                                           ((SeqScan *)
node->ss.ps.plan)->scanrelid,
+                                           eflags);

I can't see how this can possibly be remotely correct. The funnel
node shouldn't be limited to scanning a baserel (cf. fdw_scan_tlist).

Yes, I have removed the InitFunnel related code as discussed upthread.

+void ExecAccumulateInstInfo(FunnelState *node)

Another place where pgindent would help. There are a bunch of others
I noticed too, but I'm just mentioning a few here to make the point.

+ buffer_usage_worker = (BufferUsage *)(buffer_usage + (i *
sizeof(BufferUsage)));

Cast it to a BufferUsage * first. Then you can use &foo[i] to find
the i'th element.

Changed as per suggestion.

+    /*
+     * Re-initialize the parallel context and workers to perform
+     * rescan of relation.  We want to gracefully shutdown all the
+     * workers so that they should be able to propagate any error
+     * or other information to master backend before dying.
+     */
+    FinishParallelSetupAndAccumStats(node);

Somehow, this makes me feel like that function is badly named.

Okay, changed the function name to start with Destroy*.

+/*
+ * _readPlanInvalItem
+ */
+static PlanInvalItem *
+_readPlanInvalItem(void)
+{
+    READ_LOCALS(PlanInvalItem);
+
+    READ_INT_FIELD(cacheId);
+    READ_UINT_FIELD(hashValue);
+
+    READ_DONE();
+}

I don't see why we should need to be able to copy PlanInvalItems. In
fact, it seems like a bad idea.

This has been addressed as part of readfuncs related patch.

+#parallel_setup_cost = 0.0  # same scale as above
+#define DEFAULT_PARALLEL_SETUP_COST  0.0

This value is probably a bit on the low side.

Okay, for now I have changed it 1000, let me know if you think it still
sounds on lower side.

+int parallel_seqscan_degree = 0;

I think we should have a GUC for the maximum degree of parallelism in
a query generally, not the maximum degree of parallel sequential scan.

How about degree_of_parallelism?

Will change the patch if you are okay with the proposed name.

+    if (parallel_seqscan_degree >= MaxConnections)
+    {
+        write_stderr("%s: parallel_scan_degree must be less than
max_connections\n", progname);
+        ExitPostmaster(1);
+    }

I think this check is thoroughly unnecessary. It's comparing to the
wrong thing anyway, because what actually matters is
max_worker_processes, not max_connections. But in any case there is
no need for the check. If somebody stupidly tries an unreasonable
value for the maximum degree of parallelism, they won't get that many
workers, but nothing will break. It's no worse than setting any other
query planner costing parameter to an insane value.

Removed this check.

--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -126,6 +126,7 @@ extern void heap_rescan_set_params(HeapScanDesc
scan, ScanKey key,
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection

direction);

+
extern bool heap_fetch(Relation relation, Snapshot snapshot,

Stray whitespace change.

Removed.

This looks like another instance of using the walker incorrectly.
Nodes where you just want to let the walk continue shouldn't need to
be enumerated; dispatching like this should be the default case.

+               case T_Result:
+                       fix_opfuncids((Node*) (((Result
*)node)->resconstantqual));
+                       break;

Seems similarly wrong.

I have not changed as this will completely go, once I will rebase it to
your recent commit-9f1255ac.

+ * cost_patialseqscan

Typo. The actual function name has the same typo.

Changed as per your suggestion.

+       num_parallel_workers = parallel_seqscan_degree;
+       if (parallel_seqscan_degree <= estimated_parallel_workers)
+               num_parallel_workers = parallel_seqscan_degree;
+       else
+               num_parallel_workers = estimated_parallel_workers;

Use Min?

Okay, that better and changed it accordingly.

Fixed the below Comments by Hari

+ if (inst_options)
+ {
+ instoptions = shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+ *inst_options = *instoptions;
+ if (inst_options)

Same pointer variable check, it should be if (*inst_options) as per the
estimate and store functions.

Fixed.

+ if (funnelstate->ss.ps.ps_ProjInfo)
+ slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+ else
+ slot = funnelstate->ss.ss_ScanTupleSlot;

Currently, there will not be a projinfo for funnel node. So always it uses
the scan tuple slot. In case if it is different, we need to add the

ExecProject

call in ExecFunnel function. Currently it is not present, either we can

document

it or add the function call.

Added an appropriate comment as discussed upthread.

+ if (parallel_seqscan_degree >= MaxConnections)
+ {
+ write_stderr("%s: parallel_scan_degree must be less than
max_connections\n", progname);
+ ExitPostmaster(1);
+ }

The error condition works only during server start. User still can set
parallel seqscan degree
more than max connection at super user session level and etc.

removed this check.

+ if (!parallelstmt->inst_options)
+ (*receiver->rDestroy) (receiver);

Why only when there is no instruementation only, the receiver needs to
be destroyed?

It should be destroyed unconditionally.

While commiting tqueue related changes (commit- 4a4e6893), you have
changed the signature of below API:
+CreateTupleQueueDestReceiver(shm_mq_handle *handle).

I am not sure how we can pass handle to this API without changing
prototype of other API's or may be use this directly instead of
CreateDestReceiver(). Also at other places we already use Set
API's to set the receiver params, so for now I have defined the
Set API to achieve the same.

So to summarize, the following are the main items which are still left:

1. Rename Funnel to Gather, per discussion.

2. Add an ID to each plan node and use that ID as the DSM key.

I feel this needs some more discussion.

3. Currently subplans are not passed to workers, so any query which
generates subplan that needs to be pushed to worker is going to have
unpredictable behaviour. This needs some more discussion.

4. Below comment by Robert -
I would expect EXPLAIN should show the # of workers planned, and
EXPLAIN ANALYZE should show both the planned and actual values.

5. Serialization of bind parameters might need some work[1]/messages/by-id/CA+TgmoZevF_DAhqbo4j7VhwmEzSZT3wprZviqW4zvao1qgC_wA@mail.gmail.com

[1]: /messages/by-id/CA+TgmoZevF_DAhqbo4j7VhwmEzSZT3wprZviqW4zvao1qgC_wA@mail.gmail.com
/messages/by-id/CA+TgmoZevF_DAhqbo4j7VhwmEzSZT3wprZviqW4zvao1qgC_wA@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_funnel_v18.patchapplication/octet-stream; name=parallel_seqscan_funnel_v18.patchDownload
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..639451a 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -26,9 +26,9 @@
 
 static void printtup_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-static void printtup(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_20(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
 static void printtup_shutdown(DestReceiver *self);
 static void printtup_destroy(DestReceiver *self);
 
@@ -299,7 +299,7 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
  *		printtup --- print a tuple in protocol 3.0
  * ----------------
  */
-static void
+static bool
 printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -376,13 +376,15 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
  *		printtup_20 --- print a tuple in protocol 2.0
  * ----------------
  */
-static void
+static bool
 printtup_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -452,6 +454,8 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
@@ -528,7 +532,7 @@ debugStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		debugtup - print one tuple for an interactive backend
  * ----------------
  */
-void
+bool
 debugtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -553,6 +557,8 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
 		printatt((unsigned) i + 1, typeinfo->attrs[i], value);
 	}
 	printf("\t----\n");
+
+	return true;
 }
 
 /* ----------------
@@ -564,7 +570,7 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
  * This is largely same as printtup_20, except we use binary formatting.
  * ----------------
  */
-static void
+static bool
 printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -636,4 +642,6 @@ printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8db1b35..b55c4dc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4414,7 +4414,7 @@ copy_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * copy_dest_receive --- receive one tuple
  */
-static void
+static bool
 copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_copy    *myState = (DR_copy *) self;
@@ -4426,6 +4426,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 	/* And send the data */
 	CopyOneRowTo(cstate, InvalidOid, slot->tts_values, slot->tts_isnull);
 	myState->processed++;
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 41183f6..418b0f6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,7 +62,7 @@ typedef struct
 static ObjectAddress CreateAsReladdr = {InvalidOid, InvalidOid, 0};
 
 static void intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void intorel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool intorel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void intorel_shutdown(DestReceiver *self);
 static void intorel_destroy(DestReceiver *self);
 
@@ -482,7 +482,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * intorel_receive --- receive one tuple
  */
-static void
+static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
@@ -507,6 +507,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f0d9e94..e99ad56 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -20,6 +20,7 @@
 #include "commands/defrem.h"
 #include "commands/prepare.h"
 #include "executor/hashjoin.h"
+#include "executor/nodeFunnel.h"
 #include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
@@ -730,6 +731,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -853,6 +855,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_Funnel:
+			pname = sname = "Funnel";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1003,6 +1008,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1147,6 +1153,15 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend workers for
+	 * Funnel node.  Though we already accumulate this information when last
+	 * tuple is fetched from Funnel node, this is to cover cases when we don't
+	 * fetch all tuples from a node such as for Limit node.
+	 */
+	if (es->analyze && IsA(plan, Funnel))
+		DestroyParallelSetupAndAccumStats((FunnelState *) planstate);
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1276,6 +1291,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Funnel:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+								   ((Funnel *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2335,6 +2358,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5492e59..750a59c 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -56,7 +56,7 @@ typedef struct
 static int	matview_maintenance_depth = 0;
 
 static void transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static void refresh_matview_datafill(DestReceiver *dest, Query *query,
@@ -422,7 +422,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * transientrel_receive --- receive one tuple
  */
-static void
+static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
@@ -441,6 +441,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 249534b..fb34864 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -13,11 +13,11 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execProcnode.o execQual.o execScan.o execTuples.o \
+       execMain.o execParallel.o execProcnode.o execQual.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeFunnel.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 93e1e9a..4915151 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -160,6 +161,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_FunnelState:
+			ExecReScanFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -467,6 +472,9 @@ ExecSupportsBackwardScan(Plan *node)
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
 
+		case T_Funnel:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..650fcc5 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 85ff46b..66e015b 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -45,9 +45,11 @@
 #include "commands/matview.h"
 #include "commands/trigger.h"
 #include "executor/execdebug.h"
+#include "executor/execParallel.h"
 #include "foreign/fdwapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -354,7 +356,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		(*dest->rShutdown) (dest);
 
 	if (queryDesc->totaltime)
+	{
+		/* Accumulate stats from parallel workers before stopping node */
+		(void) ExecParallelBufferUsageAccum((Node *) queryDesc->planstate);
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -1581,7 +1587,15 @@ ExecutePlan(EState *estate,
 		 * practice, this is probably always the case at this point.)
 		 */
 		if (sendTuples)
-			(*dest->receiveSlot) (slot, dest);
+		{
+			/*
+			 * If we are not able to send the tuple, we assume the destination
+			 * has closed and no more tuples can be sent. If that's the case,
+			 * end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
+		}
 
 		/*
 		 * Count tuples processed, if this is a SELECT.  (For other operation
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 0000000..726bc7f
--- /dev/null
+++ b/src/backend/executor/execParallel.c
@@ -0,0 +1,542 @@
+/*-------------------------------------------------------------------------
+ *
+ * execParallel.c
+ *	  Support routines for setting up backend workers for parallel execution.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "tcop/tcopprot.h"
+
+
+#define PARALLEL_TUPLE_QUEUE_SIZE					65536
+
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static void EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size);
+static void StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space);
+static void EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState *planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size);
+static void StorePlannedStmt(ParallelContext *pcxt, PlanState *planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size);
+static void EstimateResponseQueueSpace(ParallelContext *pcxt);
+static void StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp);
+static void
+			ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt);
+static void GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage);
+static void SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq);
+
+
+/*
+ * This is required for parallel plan execution to fetch the information
+ * from dsm.
+ */
+static shm_toc *parallel_shm_toc = NULL;
+
+typedef struct parallel_estimate_ctx
+{
+	ParallelContext *context;
+	Size	   *psize;
+
+} parallel_estimate_ctx;
+
+static bool ExecParallelEstimate(Node *node,
+					 parallel_estimate_ctx *pestcontext);
+static bool ExecParallelInitializeDSM(Node *node,
+						  parallel_estimate_ctx *pestcontext);
+
+/*
+ * EstimateParallelSupportInfoSpace
+ *
+ * Estimate the amount of space required to record information of bind
+ * parameters, PARAM_EXEC parameters and instrumentation information that
+ * need to be retrieved from parallel workers.
+ */
+void
+EstimateParallelSupportInfoSpace(ParallelContext *pcxt, ParamListInfo params,
+								 List *serialized_param_exec_vals,
+								 int instOptions, Size *params_size,
+								 Size *params_exec_size)
+{
+	*params_size = EstimateBoundParametersSpace(params);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_size);
+
+	*params_exec_size = EstimateExecParametersSpace(serialized_param_exec_vals);
+	shm_toc_estimate_chunk(&pcxt->estimator, *params_exec_size);
+
+	/*
+	 * We expect each worker to populate the BufferUsage and instrumentation
+	 * structure allocated by master backend and then master backend will
+	 * aggregate all the usage along with it's own, so account it for each
+	 * worker.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+
+	/* account for instrumentation options. */
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int));
+
+	if (instOptions)
+	{
+		shm_toc_estimate_chunk(&pcxt->estimator,
+							   sizeof(Instrumentation) * pcxt->nworkers);
+		/* keys for parallel support information. */
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* keys for parallel support information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 4);
+}
+
+/*
+ * StoreParallelSupportInfo
+ *
+ * Sets up the bind parameters, PARAM_EXEC parameters and instrumentation
+ * information required for parallel execution.
+ */
+void
+StoreParallelSupportInfo(ParallelContext *pcxt, ParamListInfo params,
+						 List *serialized_param_exec_vals,
+						 int instOptions, Size params_size,
+						 Size params_exec_size,
+						 char **inst_options_space,
+						 char **buffer_usage_space)
+{
+	char	   *paramsdata;
+	char	   *paramsexecdata;
+	int		   *inst_options;
+
+	/* Store PARAM_EXEC aka bind parameters in dynamic shared memory */
+	paramsdata = shm_toc_allocate(pcxt->toc, params_size);
+	SerializeBoundParams(params, params_size, paramsdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, paramsdata);
+
+	/*
+	 * Store PARAM_EXEC parameters in dynamic shared memory.  This is used for
+	 * evaluation plan->initPlan params.
+	 */
+	paramsexecdata = shm_toc_allocate(pcxt->toc, params_exec_size);
+	SerializeExecParams(serialized_param_exec_vals, params_exec_size, paramsexecdata);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS_EXEC, paramsexecdata);
+
+	/*
+	 * Allocate space for BufferUsage information to be filled by each worker.
+	 */
+	*buffer_usage_space =
+		shm_toc_allocate(pcxt->toc, sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFF_USAGE, *buffer_usage_space);
+
+	/* Store instrument options in dynamic shared memory. */
+	inst_options = shm_toc_allocate(pcxt->toc, sizeof(int));
+	*inst_options = instOptions;
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_OPTIONS, inst_options);
+
+	/*
+	 * Allocate space for instrumentation information to be filled by each
+	 * worker.
+	 */
+	if (instOptions)
+	{
+		*inst_options_space =
+			shm_toc_allocate(pcxt->toc, sizeof(Instrumentation) * pcxt->nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INST_INFO, *inst_options_space);
+	}
+}
+
+/*
+ * EstimatePlannedStmtSpace
+ *
+ * Estimate the amount of space required to record information of planned
+ * statement and parallel node specific information that need to be copied
+ * to parallel workers.
+ */
+void
+EstimatePlannedStmtSpace(ParallelContext *pcxt, PlanState *planstate,
+						 char *plannedstmt_str, Size *plannedstmt_len,
+						 Size *pscan_size)
+{
+	parallel_estimate_ctx *pestcontext;
+
+	pestcontext = palloc(sizeof(parallel_estimate_ctx));
+	pestcontext->context = pcxt;
+	pestcontext->psize = pscan_size;
+
+	/* Estimate space for planned statement. */
+	*plannedstmt_len = strlen(plannedstmt_str) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, *plannedstmt_len);
+
+	/* keys for planned statement information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	(void) ExecParallelEstimate((Node *) planstate, pestcontext);
+}
+
+/*
+ * StorePlannedStmt
+ *
+ * Sets up the planned statement and node specific information.
+ */
+void
+StorePlannedStmt(ParallelContext *pcxt, PlanState *planstate,
+				 char *plannedstmt_str, Size plannedstmt_size,
+				 Size pscan_size)
+{
+	char	   *plannedstmtdata;
+	parallel_estimate_ctx *pestcontext;
+
+	pestcontext = palloc(sizeof(parallel_estimate_ctx));
+	pestcontext->context = pcxt;
+	pestcontext->psize = &pscan_size;
+
+	/* Store planned statement in dynamic shared memory. */
+	plannedstmtdata = shm_toc_allocate(pcxt->toc, plannedstmt_size);
+	memcpy(plannedstmtdata, plannedstmt_str, plannedstmt_size);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, plannedstmtdata);
+
+	(void) ExecParallelInitializeDSM((Node *) planstate, pestcontext);
+}
+
+/*
+ * EstimateResponseQueueSpace
+ *
+ * Estimate the amount of space required to record information of tuple
+ * queues that need to be established between parallel workers and master
+ * backend.
+ */
+void
+EstimateResponseQueueSpace(ParallelContext *pcxt)
+{
+	/* Estimate space for tuple queues. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* keys for response queue. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/*
+ * StoreResponseQueue
+ *
+ * It sets up the response queues for backend workers to return tuples
+ * to the main backend and start the workers.
+ */
+void
+StoreResponseQueue(ParallelContext *pcxt,
+				   shm_mq_handle ***responseqp)
+{
+	shm_mq	   *mq;
+	char	   *tuple_queue_space;
+	int			i;
+
+	/* Allocate memory for shared memory queue handles. */
+	*responseqp = (shm_mq_handle **) palloc(pcxt->nworkers * sizeof(shm_mq_handle *));
+
+	/*
+	 * Establish one message queue per worker in dynamic shared memory. These
+	 * queues should be used to transmit tuple data.
+	 */
+	tuple_queue_space =
+		shm_toc_allocate(pcxt->toc, PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		mq = shm_mq_create(tuple_queue_space + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+
+		shm_mq_set_receiver(mq, MyProc);
+
+		/*
+		 * Attach the queue before launching a worker, so that we'll
+		 * automatically detach the queue if we error out.  Otherwise, the
+		 * worker might sit there trying to write the queue long after we've
+		 * gone away.
+		 */
+		(*responseqp)[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tuple_queue_space);
+}
+
+/*
+ * ExecParallelEstimate
+ *
+ * Estimate the amount of space required to record information of
+ * parallel node that need to be copied to parallel workers.
+ */
+static bool
+ExecParallelEstimate(Node *node, parallel_estimate_ctx *pestcontext)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		default:
+			break;
+	}
+
+	return planstate_tree_walker((PlanState *) node, ExecParallelEstimate, pestcontext);
+}
+
+/*
+ *	ExecParallelInitializeDSM
+ *
+ *		Store the information of parallel node in dsm.
+ */
+static bool
+ExecParallelInitializeDSM(Node *node, parallel_estimate_ctx *pestcontext)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		default:
+			break;
+	}
+
+	return planstate_tree_walker((PlanState *) node, ExecParallelInitializeDSM, pestcontext);
+}
+
+/*
+ * InitializeParallelWorkers
+ *
+ * Sets up the required infrastructure for backend workers to perform
+ * execution and return results to the main backend.
+ */
+void
+InitializeParallelWorkers(PlanState *planstate,
+						  List *serialized_param_exec_vals,
+						  EState *estate,
+						  char **inst_options_space,
+						  char **buffer_usage_space,
+						  shm_mq_handle ***responseqp,
+						  ParallelContext **pcxtp,
+						  int nWorkers)
+{
+	Size		params_size,
+				params_exec_size,
+				pscan_size,
+				plannedstmt_size;
+	char	   *plannedstmt_str;
+	PlannedStmt *plannedstmt;
+	ParallelContext *pcxt;
+
+	pcxt = CreateParallelContext(ParallelQueryMain, nWorkers);
+
+	plannedstmt = create_parallel_worker_plannedstmt(planstate->plan,
+													 estate->es_range_table,
+										 estate->es_plannedstmt->nParamExec);
+	plannedstmt_str = nodeToString(plannedstmt);
+
+	EstimatePlannedStmtSpace(pcxt, planstate, plannedstmt_str,
+							 &plannedstmt_size, &pscan_size);
+	EstimateParallelSupportInfoSpace(pcxt, estate->es_param_list_info,
+									 serialized_param_exec_vals,
+									 estate->es_instrument, &params_size,
+									 &params_exec_size);
+	EstimateResponseQueueSpace(pcxt);
+
+	InitializeParallelDSM(pcxt);
+
+	StorePlannedStmt(pcxt, planstate, plannedstmt_str,
+					 plannedstmt_size, pscan_size);
+	StoreParallelSupportInfo(pcxt, estate->es_param_list_info,
+							 serialized_param_exec_vals,
+							 estate->es_instrument,
+							 params_size,
+							 params_exec_size,
+							 inst_options_space,
+							 buffer_usage_space);
+	StoreResponseQueue(pcxt, responseqp);
+
+	/* Return results to caller. */
+	*pcxtp = pcxt;
+}
+
+/*
+ * GetParallelSupportInfo
+ *
+ * Look up based on keys in dynamic shared memory segment and get the
+ * bind parameters, PARAM_EXEC parameters and instrumentation information
+ * required to perform parallel operation.
+ */
+static void
+GetParallelSupportInfo(shm_toc *toc, ParamListInfo *params,
+					   List **serialized_param_exec_vals,
+					   int *inst_options, char **instrument,
+					   char **buffer_usage)
+{
+	char	   *paramsdata;
+	char	   *paramsexecdata;
+	char	   *inst_options_space;
+	char	   *buffer_usage_space;
+	int		   *instoptions;
+
+	paramsdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	*params = RestoreBoundParams(paramsdata);
+
+	paramsexecdata = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS_EXEC);
+	*serialized_param_exec_vals = RestoreExecParams(paramsexecdata);
+
+	instoptions = shm_toc_lookup(toc, PARALLEL_KEY_INST_OPTIONS);
+	*inst_options = *instoptions;
+	if (*inst_options)
+	{
+		inst_options_space = shm_toc_lookup(toc, PARALLEL_KEY_INST_INFO);
+		*instrument = (inst_options_space +
+					   ParallelWorkerNumber * sizeof(Instrumentation));
+	}
+
+	buffer_usage_space = shm_toc_lookup(toc, PARALLEL_KEY_BUFF_USAGE);
+	*buffer_usage = (buffer_usage_space + ParallelWorkerNumber * sizeof(BufferUsage));
+}
+
+/*
+ * ExecParallelGetPlannedStmt
+ *
+ * Look up based on keys in dynamic shared memory segment and get the
+ * planned statement required to perform parallel operation.
+ */
+void
+ExecParallelGetPlannedStmt(shm_toc *toc, PlannedStmt **plannedstmt)
+{
+	char	   *plannedstmtdata;
+
+	plannedstmtdata = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+
+	*plannedstmt = (PlannedStmt *) stringToNode(plannedstmtdata);
+
+	/* Fill in opfuncid values if missing */
+	fix_node_funcids((*plannedstmt)->planTree);
+}
+
+/*
+ * SetupResponseQueue
+ *
+ * Look up based on keys in dynamic shared memory segment and get the
+ * tuple queue information for a particular worker, attach to the queue
+ * and redirect all futher responses from worker backend via that queue.
+ */
+void
+SetupResponseQueue(dsm_segment *seg, shm_toc *toc, shm_mq **mq,
+				   shm_mq_handle **responseq)
+{
+	char	   *tuple_queue_space;
+
+	tuple_queue_space = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	*mq = (shm_mq *) (tuple_queue_space +
+					  ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE);
+
+	shm_mq_set_sender(*mq, MyProc);
+	*responseq = shm_mq_attach(*mq, seg, NULL);
+}
+
+/*
+ * GetParallelShmToc
+ */
+shm_toc *
+GetParallelShmToc(void)
+{
+	return parallel_shm_toc;
+}
+
+/*
+ * ParallelQueryMain
+ *
+ * Execute the operation to return the tuples or other information to
+ * parallelism driving node.
+ */
+void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	shm_mq	   *mq;
+	shm_mq_handle *responseq;
+	PlannedStmt *plannedstmt;
+	ParamListInfo params;
+	List	   *serialized_param_exec_vals;
+	int			inst_options;
+	char	   *instrument = NULL;
+	char	   *buffer_usage = NULL;
+	ParallelStmt *parallelstmt;
+
+	SetupResponseQueue(seg, toc, &mq, &responseq);
+
+	ExecParallelGetPlannedStmt(toc, &plannedstmt);
+	GetParallelSupportInfo(toc, &params, &serialized_param_exec_vals,
+						   &inst_options, &instrument, &buffer_usage);
+
+	parallelstmt = palloc(sizeof(ParallelStmt));
+
+	parallelstmt->plannedstmt = plannedstmt;
+	parallelstmt->params = params;
+	parallelstmt->serialized_param_exec_vals = serialized_param_exec_vals;
+	parallelstmt->inst_options = inst_options;
+	parallelstmt->instrument = instrument;
+	parallelstmt->buffer_usage = buffer_usage;
+	parallelstmt->responseq = responseq;
+
+	parallel_shm_toc = toc;
+
+	/* Execute the worker command. */
+	exec_parallel_stmt(parallelstmt);
+
+	/*
+	 * Once we are done with sending tuples, detach from shared memory message
+	 * queue used to send tuples.
+	 */
+	shm_mq_detach(mq);
+}
+
+/*
+ * ExecParallelBufferUsageAccum
+ *
+ * Recursively accumulate the stats for all the funnel nodes in a plan
+ * state tree.
+ */
+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_FunnelState:
+			{
+				DestroyParallelSetupAndAccumStats((FunnelState *) node);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker((PlanState *) node, ExecParallelBufferUsageAccum, NULL);
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 03c2feb..c181bf2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -196,6 +197,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_Funnel:
+			result = (PlanState *) ExecInitFunnel((Funnel *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -416,6 +422,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_FunnelState:
+			result = ExecFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -658,6 +668,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_FunnelState:
+			ExecEndFunnel((FunnelState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a05d8b1..d5619bd 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1313,7 +1313,7 @@ do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull)
 	ExecStoreVirtualTuple(slot);
 
 	/* send the tuple to the receiver */
-	(*tstate->dest->receiveSlot) (slot, tstate->dest);
+	(void) (*tstate->dest->receiveSlot) (slot, tstate->dest);
 
 	/* clean up */
 	ExecClearTuple(slot);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index a735815..3213684 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -964,3 +964,27 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 
 	MemoryContextSwitchTo(oldcontext);
 }
+
+/*
+ * Populate the values of PARAM_EXEC parameters.
+ *
+ * This is used by worker backends to fill in the values of PARAM_EXEC
+ * parameters after fetching the same from dynamic shared memory.
+ * This needs to be called before ExecutorRun.
+ */
+void
+PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals)
+{
+	ListCell   *lparam;
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		SerializedParamExecData *param_val = (SerializedParamExecData *) lfirst(lparam);
+
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].value =
+			param_val->value;
+		queryDesc->estate->es_param_exec_vals[param_val->paramid].isnull =
+			param_val->isnull;
+	}
+}
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..863bd64 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -167,7 +167,7 @@ static Datum postquel_get_single_result(TupleTableSlot *slot,
 static void sql_exec_error_callback(void *arg);
 static void ShutdownSQLFunction(Datum arg);
 static void sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
 static void sqlfunction_shutdown(DestReceiver *self);
 static void sqlfunction_destroy(DestReceiver *self);
 
@@ -1903,7 +1903,7 @@ sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * sqlfunction_receive --- receive one tuple
  */
-static void
+static bool
 sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_sqlfunction *myState = (DR_sqlfunction *) self;
@@ -1913,6 +1913,8 @@ sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Store the filtered tuple into the tuplestore */
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..639eb04 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,9 +19,6 @@
 
 BufferUsage pgBufferUsage;
 
-static void BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add, const BufferUsage *sub);
-
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -127,8 +124,29 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/*
+ * Aggregate the instrumentation information.  This is used to aggregate
+ * the information of worker backends.  We only need to sum the buffer
+ * usage and tuple count statistics as for other timing related statistics
+ * it is sufficient to have the master backends information.
+ */
+void
+InstrAggNode(Instrumentation *instr1, Instrumentation *instr2)
+{
+	/* count the returned tuples */
+	instr1->tuplecount += instr2->tuplecount;
+
+	instr1->nfiltered1 += instr2->nfiltered1;
+	instr1->nfiltered2 += instr2->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr1->need_bufusage)
+		BufferUsageAdd(&instr1->bufusage, &instr2->bufusage);
+
+}
+
 /* dst += add - sub */
-static void
+void
 BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add,
 					 const BufferUsage *sub)
@@ -148,3 +166,21 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
 						  add->blk_write_time, sub->blk_write_time);
 }
+
+/* dst += add */
+void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
diff --git a/src/backend/executor/nodeFunnel.c b/src/backend/executor/nodeFunnel.c
new file mode 100644
index 0000000..81f4074
--- /dev/null
+++ b/src/backend/executor/nodeFunnel.c
@@ -0,0 +1,399 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.c
+ *	  Support routines for scanning a relation via multiple workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeFunnel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecFunnel				scans a relation using worker backends.
+ *		ExecInitFunnel			creates and initializes a funnel node.
+ *		ExecEndFunnel			releases any storage allocated.
+ *		ExecReScanFunnel		Re-initialize the workers and rescans a relation via them.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodeFunnel.h"
+#include "executor/nodeSubplan.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *funnel_getnext(FunnelState *funnelstate);
+static void ExecAccumulateInstInfo(FunnelState *node);
+static void ExecAccumulateBufUsageInfo(FunnelState *node);
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitFunnel
+ * ----------------------------------------------------------------
+ */
+FunnelState *
+ExecInitFunnel(Funnel * node, EState *estate, int eflags)
+{
+	FunnelState *funnelstate;
+
+	/* Funnel node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	funnelstate = makeNode(FunnelState);
+	funnelstate->ss.ps.plan = (Plan *) node;
+	funnelstate->ss.ps.state = estate;
+	funnelstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &funnelstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	funnelstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) funnelstate);
+	funnelstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) funnelstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &funnelstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &funnelstate->ss);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(funnelstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	funnelstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&funnelstate->ss.ps);
+	ExecAssignProjectionInfo(&funnelstate->ss.ps, NULL);
+
+	return funnelstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecFunnel(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecFunnel(FunnelState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution. We do
+	 * this on first execution rather than during node initialization, as it
+	 * needs to allocate large dynamic segement, so it is better to do if it
+	 * is really needed.
+	 */
+	if (!node->pcxt)
+	{
+		EState	   *estate = node->ss.ps.state;
+		ExprContext *econtext = node->ss.ps.ps_ExprContext;
+		bool		any_worker_launched = false;
+		List	   *serialized_param_exec;
+
+		/*
+		 * Evaluate the InitPlan and pass the PARAM_EXEC params, so that
+		 * values can be shared with worker backend.  This is different from
+		 * the way InitPlans are evaluated (lazy evaluation) at other places
+		 * as instead of sharing the InitPlan to all the workers and let them
+		 * execute, we pass the values which can be directly used by worker
+		 * backends.
+		 */
+		serialized_param_exec = ExecAndFormSerializeParamExec(econtext,
+									   node->ss.ps.plan->lefttree->allParam);
+
+		/* Initialize the workers required to execute funnel node. */
+		InitializeParallelWorkers(node->ss.ps.lefttree,
+								  serialized_param_exec,
+								  estate,
+								  &node->inst_options_space,
+								  &node->buffer_usage_space,
+								  &node->responseq,
+								  &node->pcxt,
+							   ((Funnel *) (node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are not
+		 * available then we perform the scan with available workers and if
+		 * there are no more workers available, then the funnel node will just
+		 * scan locally.
+		 */
+		LaunchParallelWorkers(node->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pcxt->nworkers; ++i)
+		{
+			if (node->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->responseq)[i], node->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->responseq)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+
+	slot = funnel_getnext(node);
+
+	if (TupIsNull(slot))
+	{
+		/*
+		 * Destroy the parallel context once we complete fetching all the
+		 * tuples, this will ensure that if in the same statement we need to
+		 * have Funnel node for multiple parts of statement, it won't
+		 * accumulate lot of dsm segments and workers can be made available to
+		 * use by other parts of statement.
+		 */
+		DestroyParallelSetupAndAccumStats(node);
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndFunnel
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndFunnel(FunnelState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	ExecEndNode(outerPlanState(node));
+
+	DestroyParallelSetupAndAccumStats(node);
+}
+
+/*
+ * funnel_getnext
+ *
+ * Get the next tuple from shared memory queue.  This function
+ * is reponsible for fetching tuples from all the queues associated
+ * with worker backends used in funnel scan and if there is no
+ * data available from queues or no worker is available, it does
+ * fetch the data from local node.
+ */
+TupleTableSlot *
+funnel_getnext(FunnelState *funnelstate)
+{
+	PlanState  *outerPlan;
+	TupleTableSlot *outerTupleSlot;
+	TupleTableSlot *slot;
+	HeapTuple	tup;
+
+	/*
+	 * We can use projection info of funnel for the tuples received from
+	 * worker backends as currently for all cases worker backends sends the
+	 * projected tuple as required by funnel node.
+	 */
+	slot = funnelstate->ss.ps.ps_ProjInfo->pi_slot;
+
+	while ((!funnelstate->all_workers_done && funnelstate->fs_workersReady) ||
+		   !funnelstate->local_scan_done)
+	{
+		if (!funnelstate->all_workers_done && funnelstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(funnelstate->funnel,
+									   !funnelstate->local_scan_done,
+									   &funnelstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer,	/* buffer associated with this
+												 * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!funnelstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(funnelstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			funnelstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *		DestroyParallelSetupAndAccumStats
+ *
+ *		Destroy the setup for parallel workers.  Collect all the
+ *		stats after workers are stopped, else some work done by
+ *		workers won't be accounted.
+ * ----------------------------------------------------------------
+ */
+void
+DestroyParallelSetupAndAccumStats(FunnelState *node)
+{
+	if (node->pcxt)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		if (node->fs_workersReady)
+		{
+			DestroyTupleQueueFunnel(node->funnel);
+			node->funnel = NULL;
+			WaitForParallelWorkersToFinish(node->pcxt);
+		}
+
+		/*
+		 * Aggregate the buffer usage stats from all workers.  This is
+		 * required by external modules like pg_stat_statements.
+		 */
+		ExecAccumulateBufUsageInfo(node);
+
+		/*
+		 * Aggregate instrumentation information of all the backend workers
+		 * for Funnel node.  This has to be done before we destroy the
+		 * parallel context.
+		 */
+		if (node->ss.ps.state->es_instrument)
+			ExecAccumulateInstInfo(node);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pcxt);
+		node->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateInstInfo
+ *
+ *		Accumulate instrumentation information of all the workers
+ * ----------------------------------------------------------------
+ */
+void
+ExecAccumulateInstInfo(FunnelState *node)
+{
+	int			i;
+	Instrumentation *instrument_worker;
+	int			nworkers;
+	char	   *inst_info_workers;
+
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		inst_info_workers = node->inst_options_space;
+		for (i = 0; i < nworkers; i++)
+		{
+			instrument_worker = (Instrumentation *) (inst_info_workers + (i * sizeof(Instrumentation)));
+			InstrAggNode(node->ss.ps.instrument, instrument_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAccumulateBufUsageInfo
+ *
+ *		Accumulate buffer usage information of all the workers
+ * ----------------------------------------------------------------
+ */
+void
+ExecAccumulateBufUsageInfo(FunnelState *node)
+{
+	int			i;
+	int			nworkers;
+	BufferUsage *buffer_usage_worker;
+	char	   *buffer_usage;
+
+	if (node->pcxt)
+	{
+		nworkers = node->pcxt->nworkers;
+		buffer_usage = node->buffer_usage_space;
+
+		for (i = 0; i < nworkers; i++)
+		{
+			buffer_usage_worker = &((BufferUsage *) buffer_usage)[i];
+			BufferUsageAdd(&pgBufferUsage, buffer_usage_worker);
+		}
+	}
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanFunnel
+ *
+ *		Re-initialize the workers and rescans a relation via them.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanFunnel(FunnelState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform rescan of
+	 * relation.  We want to gracefully shutdown all the workers so that they
+	 * should be able to propagate any error or other information to master
+	 * backend before dying.
+	 */
+	DestroyParallelSetupAndAccumStats(node);
+
+	ExecReScan(node->ss.ps.lefttree);
+}
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index e66bcda..50c2f39 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -144,6 +144,7 @@ ExecNestLoop(NestLoopState *node)
 			{
 				NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
 				int			paramno = nlp->paramno;
+				TupleDesc	tdesc = outerTupleSlot->tts_tupleDescriptor;
 				ParamExecData *prm;
 
 				prm = &(econtext->ecxt_param_exec_vals[paramno]);
@@ -154,6 +155,7 @@ ExecNestLoop(NestLoopState *node)
 				prm->value = slot_getattr(outerTupleSlot,
 										  nlp->paramval->varattno,
 										  &(prm->isnull));
+				prm->ptype = tdesc->attrs[nlp->paramval->varattno - 1]->atttypid;
 				/* Flag parameter value as changed */
 				innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
 													 paramno);
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 9eb4d63..4c33bcf 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -30,11 +30,14 @@
 #include <math.h>
 
 #include "access/htup_details.h"
+#include "catalog/pg_type.h"
 #include "executor/executor.h"
 #include "executor/nodeSubplan.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "utils/array.h"
+#include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 
@@ -281,12 +284,14 @@ ExecScanSubPlan(SubPlanState *node,
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState  *exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -399,6 +404,7 @@ ExecScanSubPlan(SubPlanState *node,
 			prmdata = &(econtext->ecxt_param_exec_vals[paramid]);
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col, &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col - 1]->atttypid;
 			col++;
 		}
 
@@ -551,6 +557,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		 !TupIsNull(slot);
 		 slot = ExecProcNode(planstate))
 	{
+		TupleDesc	tdesc = slot->tts_tupleDescriptor;
 		int			col = 1;
 		ListCell   *plst;
 		bool		isnew;
@@ -568,6 +575,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 			Assert(prmdata->execPlan == NULL);
 			prmdata->value = slot_getattr(slot, col,
 										  &(prmdata->isnull));
+			prmdata->ptype = tdesc->attrs[col - 1]->atttypid;
 			col++;
 		}
 		slot = ExecProject(node->projRight, NULL);
@@ -954,6 +962,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *l;
 	bool		found = false;
 	ArrayBuildStateAny *astate = NULL;
+	Oid			ptype;
 
 	if (subLinkType == ANY_SUBLINK ||
 		subLinkType == ALL_SUBLINK)
@@ -961,6 +970,8 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	if (subLinkType == CTE_SUBLINK)
 		elog(ERROR, "CTE subplans should not be executed via ExecSetParamPlan");
 
+	ptype = exprType((Node *) node->xprstate.expr);
+
 	/* Initialize ArrayBuildStateAny in caller's context, if needed */
 	if (subLinkType == ARRAY_SUBLINK)
 		astate = initArrayResultAny(subplan->firstColType,
@@ -983,12 +994,14 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	forboth(l, subplan->parParam, pvar, node->args)
 	{
 		int			paramid = lfirst_int(l);
+		ExprState  *exprstate = (ExprState *) lfirst(pvar);
 		ParamExecData *prm = &(econtext->ecxt_param_exec_vals[paramid]);
 
-		prm->value = ExecEvalExprSwitchContext((ExprState *) lfirst(pvar),
+		prm->value = ExecEvalExprSwitchContext(exprstate,
 											   econtext,
 											   &(prm->isnull),
 											   NULL);
+		prm->ptype = exprType((Node *) exprstate->expr);
 		planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
 	}
 
@@ -1011,6 +1024,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(true);
+			prm->ptype = ptype;
 			prm->isnull = false;
 			found = true;
 			break;
@@ -1062,6 +1076,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 			prm->execPlan = NULL;
 			prm->value = heap_getattr(node->curTuple, i, tdesc,
 									  &(prm->isnull));
+			prm->ptype = tdesc->attrs[i - 1]->atttypid;
 			i++;
 		}
 	}
@@ -1084,6 +1099,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 											true);
 		prm->execPlan = NULL;
 		prm->value = node->curArray;
+		prm->ptype = ptype;
 		prm->isnull = false;
 	}
 	else if (!found)
@@ -1096,6 +1112,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 			prm->execPlan = NULL;
 			prm->value = BoolGetDatum(false);
+			prm->ptype = ptype;
 			prm->isnull = false;
 		}
 		else
@@ -1108,6 +1125,7 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 
 				prm->execPlan = NULL;
 				prm->value = (Datum) 0;
+				prm->ptype = VOIDOID;
 				prm->isnull = true;
 			}
 		}
@@ -1238,3 +1256,47 @@ ExecAlternativeSubPlan(AlternativeSubPlanState *node,
 					   isNull,
 					   isDone);
 }
+
+/*
+ * ExecAndFormSerializeParamExec
+ *
+ * Execute the subplan stored in PARAM_EXEC param if it is not executed
+ * till now and form the serialized structure required for passing to
+ * worker backend.
+ */
+List *
+ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params)
+{
+	List	   *lparam = NIL;
+	SerializedParamExecData *sparamdata;
+	ParamExecData *prm;
+	int			paramid;
+
+	paramid = -1;
+	while ((paramid = bms_next_member(params, paramid)) >= 0)
+	{
+		/*
+		 * PARAM_EXEC params (internal executor parameters) are stored in the
+		 * ecxt_param_exec_vals array, and can be accessed by array index.
+		 */
+		sparamdata = palloc0(sizeof(SerializedParamExecData));
+
+		prm = &(econtext->ecxt_param_exec_vals[paramid]);
+		if (prm->execPlan != NULL)
+		{
+			/* Parameter not evaluated yet, so go do it */
+			ExecSetParamPlan(prm->execPlan, econtext);
+			/* ExecSetParamPlan should have processed this param... */
+			Assert(prm->execPlan == NULL);
+		}
+
+		sparamdata->paramid = paramid;
+		sparamdata->ptype = prm->ptype;
+		sparamdata->value = prm->value;
+		sparamdata->isnull = prm->isnull;
+
+		lparam = lappend(lparam, sparamdata);
+	}
+
+	return lparam;
+}
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 300401e..a60f228 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1774,7 +1774,7 @@ spi_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		store tuple retrieved by Executor into SPITupleTable
  *		of current SPI procedure
  */
-void
+bool
 spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	SPITupleTable *tuptable;
@@ -1809,6 +1809,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 	(tuptable->free)--;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index d0edf4e..4e27b30 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -41,14 +41,24 @@ struct TupleQueueFunnel
 /*
  * Receive a tuple.
  */
-static void
+static bool
 tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 {
 	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
 	HeapTuple	tuple;
+	shm_mq_result result;
 
 	tuple = ExecMaterializeSlot(slot);
-	shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result == SHM_MQ_DETACHED)
+		return false;
+	else if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+
+	return true;
 }
 
 /*
@@ -82,7 +92,7 @@ tqueueDestroyReceiver(DestReceiver *self)
  * Create a DestReceiver that writes tuples to a tuple queue.
  */
 DestReceiver *
-CreateTupleQueueDestReceiver(shm_mq_handle *handle)
+CreateTupleQueueDestReceiver(void)
 {
 	TQueueDestReceiver *self;
 
@@ -93,12 +103,25 @@ CreateTupleQueueDestReceiver(shm_mq_handle *handle)
 	self->pub.rShutdown = tqueueShutdownReceiver;
 	self->pub.rDestroy = tqueueDestroyReceiver;
 	self->pub.mydest = DestTupleQueue;
-	self->handle = handle;
+
+	/* private fields will be set by SetTupleQueueDestReceiverParams */
 
 	return (DestReceiver *) self;
 }
 
 /*
+ * Set parameters for a TupleQueueDestReceiver
+ */
+void
+SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle)
+{
+	TQueueDestReceiver *myState = (TQueueDestReceiver *) self;
+
+	myState->handle = handle;
+}
+
+/*
  * Create a tuple queue funnel.
  */
 TupleQueueFunnel *
diff --git a/src/backend/executor/tstoreReceiver.c b/src/backend/executor/tstoreReceiver.c
index c1fdeb7..b0862ae 100644
--- a/src/backend/executor/tstoreReceiver.c
+++ b/src/backend/executor/tstoreReceiver.c
@@ -37,8 +37,8 @@ typedef struct
 } TStoreState;
 
 
-static void tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
-static void tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
 
 
 /*
@@ -90,19 +90,21 @@ tstoreStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the easy case where we don't have to detoast.
  */
-static void
+static bool
 tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
 
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the case where we have to detoast any toasted values.
  */
-static void
+static bool
 tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
@@ -152,6 +154,8 @@ tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 	/* And release any temporary detoasted values */
 	for (i = 0; i < nfree; i++)
 		pfree(DatumGetPointer(myState->tofree[i]));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 62355aa..ec605bd 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -382,6 +382,27 @@ _copySampleScan(const SampleScan *from)
 }
 
 /*
+ * _copyFunnel
+ */
+static Funnel *
+_copyFunnel(const Funnel *from)
+{
+	Funnel	   *newnode = makeNode(Funnel);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4240,6 +4261,9 @@ copyObject(const void *from)
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
+		case T_Funnel:
+			retval = _copyFunnel(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c91273c..bc9b481 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -458,6 +458,16 @@ _outSampleScan(StringInfo str, const SampleScan *node)
 }
 
 static void
+_outFunnel(StringInfo str, const Funnel *node)
+{
+	WRITE_NODE_TYPE("FUNNEL");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -3008,6 +3018,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
+			case T_Funnel:
+				_outFunnel(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..5751386 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,9 +16,22 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
+/*
+ * for each bind parameter, pass this structure followed by value
+ * except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExternData
+{
+	Datum		value;			/* pass-by-val are directly stored */
+	Size		length;			/* length of parameter value */
+	bool		isnull;			/* is it NULL? */
+	uint16		pflags;			/* flag bits, same as in original Param */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+}	SerializedParamExternData;
 
 /*
  * Copy a ParamListInfo structure.
@@ -73,3 +86,357 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize the bound
+ * parameters.
+ */
+Size
+EstimateBoundParametersSpace(ParamListInfo paramInfo)
+{
+	Size		size;
+	int			i;
+
+	/* Add space required for saving numParams */
+	size = sizeof(int);
+
+	if (paramInfo)
+	{
+		/* Add space required for saving the param data */
+		for (i = 0; i < paramInfo->numParams; i++)
+		{
+			/*
+			 * for each parameter, calculate the size of fixed part of
+			 * parameter (SerializedParamExternData) and length of parameter
+			 * value.
+			 */
+			ParamExternData *oprm;
+			int16		typLen;
+			bool		typByVal;
+			Size		length;
+
+			length = sizeof(SerializedParamExternData);
+
+			oprm = &paramInfo->params[i];
+
+			get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+			/*
+			 * pass-by-value parameters are directly stored in
+			 * SerializedParamExternData, so no need of additional space for
+			 * them.
+			 */
+			if (!(typByVal || oprm->isnull))
+			{
+				length += datumGetSize(oprm->value, typByVal, typLen);
+				size = add_size(size, length);
+
+				/* Allow space for terminating zero-byte */
+				size = add_size(size, 1);
+			}
+			else
+				size = add_size(size, length);
+		}
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the bind parameters into the memory, beginning at start_address.
+ * maxsize should be at least as large as the value returned by
+ * EstimateBoundParametersSpace.
+ */
+void
+SerializeBoundParams(ParamListInfo paramInfo, Size maxsize, char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExternData *retval;
+	int			i;
+
+	/*
+	 * First, we store the number of bind parameters, if there is no bind
+	 * parameter then no need to store any more information.
+	 */
+	if (paramInfo && paramInfo->numParams > 0)
+		*(int *) start_address = paramInfo->numParams;
+	else
+	{
+		*(int *) start_address = 0;
+		return;
+	}
+	curptr = start_address + sizeof(int);
+
+
+	for (i = 0; i < paramInfo->numParams; i++)
+	{
+		ParamExternData *oprm;
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength,
+					length;
+		const char *s;
+
+		Assert(curptr <= start_address + maxsize);
+		retval = (SerializedParamExternData *) curptr;
+		oprm = &paramInfo->params[i];
+
+		retval->isnull = oprm->isnull;
+		retval->pflags = oprm->pflags;
+		retval->ptype = oprm->ptype;
+		retval->value = oprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(oprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(oprm->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(oprm->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreBoundParams
+ *		Restore bind parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+ParamListInfo
+RestoreBoundParams(char *start_address)
+{
+	ParamListInfo retval;
+	Size		size;
+	int			num_params,
+				i;
+	char	   *curptr;
+
+	num_params = *(int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	size = offsetof(ParamListInfoData, params) +
+		num_params * sizeof(ParamExternData);
+	retval = (ParamListInfo) palloc(size);
+	retval->paramFetch = NULL;
+	retval->paramFetchArg = NULL;
+	retval->parserSetup = NULL;
+	retval->parserSetupArg = NULL;
+	retval->numParams = num_params;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExternData *nprm;
+		char	   *s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExternData *) curptr;
+
+		/* copy the parameter info */
+		retval->params[i].isnull = nprm->isnull;
+		retval->params[i].pflags = nprm->pflags;
+		retval->params[i].ptype = nprm->ptype;
+		retval->params[i].value = nprm->value;
+
+		curptr = curptr + sizeof(SerializedParamExternData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			retval->params[i].value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+	}
+
+	return retval;
+}
+
+/*
+ * Estimate the amount of space required to serialize the PARAM_EXEC
+ * parameters.
+ */
+Size
+EstimateExecParametersSpace(List *serialized_param_exec_vals)
+{
+	Size		size;
+	ListCell   *lparam;
+
+	/*
+	 * Add space required for saving number of PARAM_EXEC parameters that
+	 * needs to be serialized.
+	 */
+	size = sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		length;
+		SerializedParamExecData *param_val = (SerializedParamExecData *) lfirst(lparam);
+
+		length = sizeof(SerializedParamExecData);
+
+		get_typlenbyval(param_val->ptype, &typLen, &typByVal);
+
+		/*
+		 * pass-by-value parameters are directly stored in
+		 * SerializedParamExternData, so no need of additional space for them.
+		 */
+		if (!(typByVal || param_val->isnull))
+		{
+			length += datumGetSize(param_val->value, typByVal, typLen);
+			size = add_size(size, length);
+
+			/* Allow space for terminating zero-byte */
+			size = add_size(size, 1);
+		}
+		else
+			size = add_size(size, length);
+	}
+
+	return size;
+}
+
+/*
+ * Serialize the PARAM_EXEC parameters into the memory, beginning at
+ * start_address.  maxsize should be at least as large as the value
+ * returned by EstimateExecParametersSpace.
+ */
+void
+SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address)
+{
+	char	   *curptr;
+	SerializedParamExecData *retval;
+	ListCell   *lparam;
+
+	/*
+	 * First, we store the number of PARAM_EXEC parameters that needs to be
+	 * serialized.
+	 */
+	if (serialized_param_exec_vals)
+		*(int *) start_address = list_length(serialized_param_exec_vals);
+	else
+	{
+		*(int *) start_address = 0;
+		return;
+	}
+
+	curptr = start_address + sizeof(int);
+
+	foreach(lparam, serialized_param_exec_vals)
+	{
+		int16		typLen;
+		bool		typByVal;
+		Size		datumlength,
+					length;
+		const char *s;
+		SerializedParamExecData *param_val = (SerializedParamExecData *) lfirst(lparam);
+
+		retval = (SerializedParamExecData *) curptr;
+
+		retval->paramid = param_val->paramid;
+		retval->value = param_val->value;
+		retval->isnull = param_val->isnull;
+		retval->ptype = param_val->ptype;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (retval->isnull)
+			continue;
+
+		get_typlenbyval(retval->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			datumlength = datumGetSize(retval->value, typByVal, typLen);
+			s = (char *) DatumGetPointer(retval->value);
+			memcpy(curptr, s, datumlength);
+			length = datumlength;
+			curptr[length] = '\0';
+			retval->length = length;
+			curptr += length + 1;
+		}
+	}
+}
+
+/*
+ * RestoreExecParams
+ *		Restore PARAM_EXEC parameters from the specified address.
+ *
+ * The params are palloc'd in CurrentMemoryContext.
+ */
+List *
+RestoreExecParams(char *start_address)
+{
+	List	   *lparamexecvals = NIL;
+	int			num_params,
+				i;
+	char	   *curptr;
+
+	num_params = *(int *) start_address;
+
+	if (num_params <= 0)
+		return NULL;
+
+	curptr = start_address + sizeof(int);
+
+	for (i = 0; i < num_params; i++)
+	{
+		SerializedParamExecData *nprm;
+		SerializedParamExecData *outparam;
+		char	   *s;
+		int16		typLen;
+		bool		typByVal;
+
+		nprm = (SerializedParamExecData *) curptr;
+
+		outparam = palloc0(sizeof(SerializedParamExecData));
+
+		/* copy the parameter info */
+		outparam->isnull = nprm->isnull;
+		outparam->value = nprm->value;
+		outparam->paramid = nprm->paramid;
+
+		curptr = curptr + sizeof(SerializedParamExecData);
+
+		if (nprm->isnull)
+			continue;
+
+		get_typlenbyval(nprm->ptype, &typLen, &typByVal);
+
+		if (!typByVal)
+		{
+			s = palloc(nprm->length + 1);
+			memcpy(s, curptr, nprm->length + 1);
+			outparam->value = CStringGetDatum(s);
+
+			curptr += nprm->length + 1;
+		}
+
+		lparamexecvals = lappend(lparamexecvals, outparam);
+	}
+
+	return lparamexecvals;
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d107d76..0a40f6c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,8 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *	cpu_tuple_comm_cost Cost of CPU time to pass a tuple from worker to master backend
+ *	parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -102,11 +104,15 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int			parallel_seqscan_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -290,6 +296,42 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_funnel
+ *	  Determines and returns the cost of funnel path.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 404c6f5..e987922 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Funnel *create_funnel_plan(PlannerInfo *root,
+				   FunnelPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -104,6 +106,9 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static Funnel *make_funnel(List *qptlist, List *qpqual,
+			Index scanrelid, int nworkers,
+			Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -273,6 +278,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 			plan = create_unique_plan(root,
 									  (UniquePath *) best_path);
 			break;
+		case T_Funnel:
+			plan = (Plan *) create_funnel_plan(root,
+											   (FunnelPath *) best_path);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
 				 (int) best_path->pathtype);
@@ -560,6 +569,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1194,6 +1204,66 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_funnel_plan
+ *
+ * Returns a funnel plan for the base relation scanned by
+ * 'best_path'.
+ */
+static Funnel *
+create_funnel_plan(PlannerInfo *root, FunnelPath *best_path)
+{
+	Funnel	   *funnel_plan;
+	Plan	   *subplan;
+	List	   *tlist;
+	RelOptInfo *rel = best_path->path.parent;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/*
+	 * For table scans, rather than using the relation targetlist (which is
+	 * only those Vars actually needed by the query), we prefer to generate a
+	 * tlist containing all Vars in order.  This will allow the executor to
+	 * optimize away projection of the table tuples, if possible.  (Note that
+	 * planner.c may replace the tlist we generate here, forcing projection to
+	 * occur.)
+	 */
+	if (use_physical_tlist(root, rel))
+	{
+		tlist = build_physical_tlist(root, rel);
+		/* if fail because of dropped cols, use regular method */
+		if (tlist == NIL)
+			tlist = build_path_tlist(root, &best_path->path);
+	}
+	else
+	{
+		tlist = build_path_tlist(root, &best_path->path);
+	}
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same as either all the quals
+	 * are pushed to subplan (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	funnel_plan = make_funnel(tlist,
+							  subplan->qual,
+							  scan_relid,
+							  best_path->num_workers,
+							  subplan);
+
+	copy_path_costsize(&funnel_plan->scan.plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return funnel_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3462,6 +3532,27 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static Funnel *
+make_funnel(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Funnel	   *node = makeNode(Funnel);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 06be922..efad32b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -342,6 +342,53 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	return result;
 }
 
+PlannedStmt *
+create_parallel_worker_plannedstmt(Plan *plan,
+								   List *rangetable,
+								   int num_exec_params)
+{
+	PlannedStmt *result;
+	ListCell   *tlist;
+
+	/*
+	 * Avoid removing junk entries in worker as those are required by upper
+	 * nodes in master backend.
+	 */
+	foreach(tlist, plan->targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/* build the PlannedStmt result */
+	result = makeNode(PlannedStmt);
+
+	result->commandType = CMD_SELECT;
+	result->queryId = 0;
+	result->hasReturning = 0;
+	result->hasModifyingCTE = 0;
+	result->canSetTag = 1;
+	result->transientPlan = 0;
+	result->planTree = plan;
+	result->rtable = rangetable;
+	result->resultRelations = NIL;
+	result->utilityStmt = NULL;
+	result->subplans = NIL;
+	result->rewindPlanIDs = NULL;
+	result->rowMarks = NIL;
+	result->nParamExec = num_exec_params;
+
+	/*
+	 * Don't bother to set parameters used for invalidation as worker backend
+	 * plans are not saved, so can't be invalidated.
+	 */
+	result->relationOids = NIL;
+	result->invalItems = NIL;
+	result->hasRowSecurity = false;
+
+	return result;
+}
 
 /*--------------------
  * subquery_planner
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index daeb584..2cff5a9 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -465,6 +465,26 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, (Node *) splan->tablesample, rtoffset);
 			}
 			break;
+		case T_Funnel:
+			{
+				Funnel	   *splan = (Funnel *) plan;
+
+				/*
+				 * target list for leftree of funnel plan should be same as
+				 * for funnel scan as both nodes need to produce same
+				 * projection. We don't want to do this assignment after
+				 * fixing references as that will be done separately for
+				 * lefttree node.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
@@ -2265,6 +2285,40 @@ fix_opfuncids_walker(Node *node, void *context)
 }
 
 /*
+ * fix_node_funcids
+ *		Set the opfuncid (procedure OID) in an OpExpr node,
+ *		for plan tree.
+ *
+ * We need it mainly to fix the opfuncid in nodes of plantree
+ * after reading the planned statement by worker backend.
+ * Currently the support of nodes that could be executed by
+ * worker backend are limited, we can enhance this API based
+ * on it's usage in future.
+ */
+void
+fix_node_funcids(Plan *node)
+{
+	/*
+	 * do nothing when we get to the end of a leaf on tree.
+	 */
+	if (node == NULL)
+		return;
+
+	fix_opfuncids((Node *) node->qual);
+	fix_opfuncids((Node *) node->targetlist);
+
+	switch (nodeTag(node))
+	{
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	fix_node_funcids(node->lefttree);
+	fix_node_funcids(node->righttree);
+}
+
+/*
  * set_opfuncid
  *		Set the opfuncid (procedure OID) in an OpExpr node,
  *		if it hasn't been set already.
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d0bc412..073a7f5 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2243,6 +2243,10 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
+		case T_Funnel:
+			context.paramids = bms_add_members(context.paramids, scan_params);
+			break;
+
 		case T_IndexScan:
 			finalize_primnode((Node *) ((IndexScan *) plan)->indexqual,
 							  &context);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 935bc2b..c886075 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -732,6 +732,32 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_funnel_path
+ *
+ *	  Creates a path corresponding to a funnel scan, returning the
+ *	  pathnode.
+ */
+FunnelPath *
+create_funnel_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
+				   Relids required_outer, int nworkers)
+{
+	FunnelPath *pathnode = makeNode(FunnelPath);
+
+	pathnode->path.pathtype = T_Funnel;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+														  required_outer);
+	pathnode->path.pathkeys = NIL;		/* Funnel has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nworkers;
+
+	cost_funnel(pathnode, root, rel, pathnode->path.param_info);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index baa43b2..58fc6d3 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index d645751..57014ee 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -45,9 +45,10 @@
  *		dummy DestReceiver functions
  * ----------------
  */
-static void
+static bool
 donothingReceive(TupleTableSlot *slot, DestReceiver *self)
 {
+	return true;
 }
 
 static void
@@ -132,7 +133,7 @@ CreateDestReceiver(CommandDest dest)
 			return CreateTransientRelDestReceiver(InvalidOid);
 
 		case DestTupleQueue:
-			return CreateTupleQueueDestReceiver(NULL);
+			return CreateTupleQueueDestReceiver();
 	}
 
 	/* should never get here */
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d1f43c5..43a8114 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,8 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/execParallel.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
@@ -1192,6 +1194,93 @@ exec_simple_query(const char *query_string)
 }
 
 /*
+ * exec_parallel_stmt
+ *
+ * Execute the plan for backend worker.
+ */
+void
+exec_parallel_stmt(ParallelStmt *parallelstmt)
+{
+	DestReceiver *receiver;
+	QueryDesc  *queryDesc;
+	MemoryContext oldcontext;
+	MemoryContext plancontext;
+	BufferUsage bufusage_start;
+	BufferUsage bufusage_end = {0};
+
+	set_ps_display("SELECT", false);
+
+	/*
+	 * Unlike exec_simple_query(), in backend worker we won't allow
+	 * transaction control statements, so we can allow plancontext to be
+	 * created in TopTransaction context.
+	 */
+	plancontext = AllocSetContextCreate(CurrentMemoryContext,
+										"worker plan",
+										ALLOCSET_DEFAULT_MINSIZE,
+										ALLOCSET_DEFAULT_INITSIZE,
+										ALLOCSET_DEFAULT_MAXSIZE);
+
+	oldcontext = MemoryContextSwitchTo(plancontext);
+
+	receiver = CreateDestReceiver(DestTupleQueue);
+	SetTupleQueueDestReceiverParams(receiver, parallelstmt->responseq);
+
+	/* Create a QueryDesc for the query */
+	queryDesc = CreateQueryDesc(parallelstmt->plannedstmt, "",
+								GetActiveSnapshot(), InvalidSnapshot,
+								receiver, parallelstmt->params,
+								parallelstmt->inst_options);
+
+	PushActiveSnapshot(queryDesc->snapshot);
+
+	/* call ExecutorStart to prepare the plan for execution */
+	ExecutorStart(queryDesc, 0);
+
+	PopulateParamExecParams(queryDesc, parallelstmt->serialized_param_exec_vals);
+
+	bufusage_start = pgBufferUsage;
+
+	/* run the plan */
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+
+	/*
+	 * Calculate the buffer usage for this statement run, it is required by
+	 * plugins like pg_stat_statements to report the total usage for statement
+	 * execution.
+	 */
+	BufferUsageAccumDiff(&bufusage_end,
+						 &pgBufferUsage, &bufusage_start);
+
+	/* run cleanup too */
+	ExecutorFinish(queryDesc);
+
+	/* copy buffer usage into shared memory. */
+	memcpy(parallelstmt->buffer_usage,
+		   &bufusage_end,
+		   sizeof(BufferUsage));
+
+	/*
+	 * copy intrumentation information into shared memory if requested by
+	 * master backend.
+	 */
+	if (parallelstmt->inst_options)
+		memcpy(parallelstmt->instrument,
+			   queryDesc->planstate->instrument,
+			   sizeof(Instrumentation));
+
+	ExecutorEnd(queryDesc);
+
+	PopActiveSnapshot();
+
+	FreeQueryDesc(queryDesc);
+
+	(*receiver->rDestroy) (receiver);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+/*
  * exec_parse_message
  *
  * Execute a "Parse" protocol message.
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 0df86a2..5eab231 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1117,7 +1117,13 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			if (!ok)
 				break;
 
-			(*dest->receiveSlot) (slot, dest);
+			/*
+			 * If we are not able to send the tuple, we assume the destination
+			 * has closed and no more tuples can be sent. If that's the case,
+			 * end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
 
 			ExecClearTuple(slot);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17053af..c30298b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -588,6 +588,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("parallel_seqscan_degree"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2535,6 +2537,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_seqscan_degree", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&parallel_seqscan_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2711,6 +2723,26 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+				  "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+				  "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8c65287..1ddf317 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -290,6 +290,8 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 1000.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -502,6 +504,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#parallel_seqscan_degree = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/printtup.h b/src/include/access/printtup.h
index 46c4148..92ec882 100644
--- a/src/include/access/printtup.h
+++ b/src/include/access/printtup.h
@@ -25,11 +25,11 @@ extern void SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist,
 
 extern void debugStartup(DestReceiver *self, int operation,
 			 TupleDesc typeinfo);
-extern void debugtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool debugtup(TupleTableSlot *slot, DestReceiver *self);
 
 /* XXX these are really in executor/spi.c */
 extern void spi_dest_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-extern void spi_printtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool spi_printtup(TupleTableSlot *slot, DestReceiver *self);
 
 #endif   /* PRINTTUP_H */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
new file mode 100644
index 0000000..9bcf9af
--- /dev/null
+++ b/src/include/executor/execParallel.h
@@ -0,0 +1,61 @@
+/*--------------------------------------------------------------------
+ * execParallel.h
+ *		POSTGRES parallel execution interface
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execParallel.h
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECPARALLEL_H
+#define EXECPARALLEL_H
+
+/*---------------------------------------------------------------------
+ * External module API.
+ *---------------------------------------------------------------------
+ */
+
+#include "libpq/pqmq.h"
+#include "nodes/execnodes.h"
+#include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
+
+/* Table-of-contents constants for our dynamic shared memory segment. */
+#define PARALLEL_KEY_PLANNEDSTMT	0
+#define PARALLEL_KEY_PARAMS			1
+#define PARALLEL_KEY_PARAMS_EXEC	2
+#define PARALLEL_KEY_BUFF_USAGE		3
+#define PARALLEL_KEY_INST_OPTIONS	4
+#define PARALLEL_KEY_INST_INFO		5
+#define PARALLEL_KEY_TUPLE_QUEUE	6
+#define PARALLEL_KEY_SCAN			7
+
+extern int	parallel_seqscan_degree;
+
+/* worker statement required for parallel execution. */
+typedef struct ParallelStmt
+{
+	PlannedStmt *plannedstmt;
+	ParamListInfo params;
+	List	   *serialized_param_exec_vals;
+	shm_mq_handle *responseq;
+	int			inst_options;
+	char	   *instrument;
+	char	   *buffer_usage;
+} ParallelStmt;
+
+extern void InitializeParallelWorkers(PlanState *planstate,
+						  List *serialized_param_exec_vals,
+						  EState *estate,
+						  char **inst_options_space,
+						  char **buffer_usage_space,
+						  shm_mq_handle ***responseqp,
+						  ParallelContext **pcxtp,
+						  int nWorkers);
+extern shm_toc *GetParallelShmToc(void);
+extern bool ExecParallelBufferUsageAccum(Node *node);
+extern void ExecAssociateBufferStatsToDSM(BufferUsage *buf_usage,
+							  ParallelStmt *parallel_stmt);
+#endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 226f905..cb05f52 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -273,6 +273,8 @@ extern TupleDesc ExecCleanTypeFromTL(List *targetList, bool hasoid);
 extern TupleDesc ExecTypeFromExprList(List *exprList);
 extern void ExecTypeSetColNames(TupleDesc typeInfo, List *namesList);
 extern void UpdateChangedParamSet(PlanState *node, Bitmapset *newchg);
+extern void PopulateParamExecParams(QueryDesc *queryDesc,
+						List *serialized_param_exec_vals);
 
 typedef struct TupOutputState
 {
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c9a2129..5e1b04e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -69,5 +69,12 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *instr1, Instrumentation *instr2);
+extern void
+			InstrAggBufferUsage(BufferUsage *buffer_usage_dst, BufferUsage *buffer_usage_add);
+extern void BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/executor/nodeFunnel.h b/src/include/executor/nodeFunnel.h
new file mode 100644
index 0000000..fdb51d4
--- /dev/null
+++ b/src/include/executor/nodeFunnel.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeFunnel.h
+ *		prototypes for nodeFunnel.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeFunnel.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEFUNNEL_H
+#define NODEFUNNEL_H
+
+#include "nodes/execnodes.h"
+
+extern FunnelState *ExecInitFunnel(Funnel *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecFunnel(FunnelState *node);
+extern void ExecEndFunnel(FunnelState *node);
+extern void DestroyParallelSetupAndAccumStats(FunnelState *node);
+extern void ExecReScanFunnel(FunnelState *node);
+
+#endif   /* NODEFUNNEL_H */
diff --git a/src/include/executor/nodeSubplan.h b/src/include/executor/nodeSubplan.h
index 3732ad4..ded4b19 100644
--- a/src/include/executor/nodeSubplan.h
+++ b/src/include/executor/nodeSubplan.h
@@ -24,4 +24,7 @@ extern void ExecReScanSetParamPlan(SubPlanState *node, PlanState *parent);
 
 extern void ExecSetParamPlan(SubPlanState *node, ExprContext *econtext);
 
+extern List *
+			ExecAndFormSerializeParamExec(ExprContext *econtext, Bitmapset *params);
+
 #endif   /* NODESUBPLAN_H */
diff --git a/src/include/executor/tqueue.h b/src/include/executor/tqueue.h
index 6f8eb73..d63416e 100644
--- a/src/include/executor/tqueue.h
+++ b/src/include/executor/tqueue.h
@@ -18,7 +18,9 @@
 #include "tcop/dest.h"
 
 /* Use this to send tuples to a shm_mq. */
-extern DestReceiver *CreateTupleQueueDestReceiver(shm_mq_handle *handle);
+extern DestReceiver *CreateTupleQueueDestReceiver(void);
+extern void SetTupleQueueDestReceiverParams(DestReceiver *self,
+								shm_mq_handle *handle);
 
 /* Use these to receive tuples from a shm_mq. */
 typedef struct TupleQueueFunnel TupleQueueFunnel;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ae2f3e..6f6128a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -1049,6 +1051,13 @@ typedef struct PlanState
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
 
 	/*
+	 * At execution time, parallel scan descriptor is initialized and stored
+	 * in dynamic shared memory segment by master backend and parallel workers
+	 * retrieve it from shared memory.
+	 */
+	shm_toc    *toc;
+
+	/*
 	 * Other run-time state needed by most if not all node types.
 	 */
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
@@ -1273,6 +1282,35 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * FunnelState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		pcxt				parallel context for managing generic state information
+ *							required for parallelism.
+ *		responseq			shared memory queues to receive data from workers.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		inst_options_space	to accumulate instrumentation information from all
+ *							parallel workers.
+ *		buffer_usage_space	to accumulate buffer usage information from all
+ *							parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct FunnelState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelContext *pcxt;
+	shm_mq_handle **responseq;
+	TupleQueueFunnel *funnel;
+	char	   *inst_options_space;
+	char	   *buffer_usage_space;
+	bool		fs_workersReady;
+	bool		all_workers_done;
+	bool		local_scan_done;
+} FunnelState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodeFuncs.h b/src/include/nodes/nodeFuncs.h
index 36b5dac..4f8be76 100644
--- a/src/include/nodes/nodeFuncs.h
+++ b/src/include/nodes/nodeFuncs.h
@@ -13,6 +13,7 @@
 #ifndef NODEFUNCS_H
 #define NODEFUNCS_H
 
+#include "access/parallel.h"
 #include "nodes/parsenodes.h"
 
 
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 274480e..48f7160 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -223,6 +225,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_FunnelPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..05b9042 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -14,6 +14,8 @@
 #ifndef PARAMS_H
 #define PARAMS_H
 
+#include "nodes/pg_list.h"
+
 /* To avoid including a pile of parser headers, reference ParseState thus: */
 struct ParseState;
 
@@ -96,11 +98,46 @@ typedef struct ParamExecData
 {
 	void	   *execPlan;		/* should be "SubPlanState *" */
 	Datum		value;
+
+	/*
+	 * parameter's datatype, or 0.  This is required so that datum value can
+	 * be read and used for other purposes like passing it to worker backend
+	 * via shared memory.  This is required only for evaluation of initPlan's,
+	 * however for consistency we set this for Subplan as well.  We left it
+	 * for other cases like CTE or RecursiveUnion cases where this structure
+	 * is not used for evaluation of subplans.
+	 */
+	Oid			ptype;
 	bool		isnull;
 } ParamExecData;
 
+/*
+ * This structure is used to pass PARAM_EXEC parameters to backend
+ * workers.  For each PARAM_EXEC parameter, pass this structure
+ * followed by value except for pass-by-value parameters.
+ */
+typedef struct SerializedParamExecData
+{
+	int			paramid;		/* parameter id of this param */
+	Size		length;			/* length of parameter value */
+	Oid			ptype;			/* parameter's datatype, or 0 */
+	Datum		value;
+	bool		isnull;
+}	SerializedParamExecData;
+
 
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
 
+extern Size
+			EstimateBoundParametersSpace(ParamListInfo params);
+extern void
+			SerializeBoundParams(ParamListInfo params, Size maxsize, char *start_address);
+extern ParamListInfo RestoreBoundParams(char *start_address);
+extern Size
+			EstimateExecParametersSpace(List *serialized_param_exec_vals);
+extern void SerializeExecParams(List *serialized_param_exec_vals, Size maxsize,
+					char *start_address);
+List *
+			RestoreExecParams(char *start_address);
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index cc259f1..69302af 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -296,6 +296,16 @@ typedef struct SampleScan
 	struct TableSampleClause *tablesample;
 } SampleScan;
 
+/* ------------
+ *		Funnel node
+ * ------------
+ */
+typedef struct Funnel
+{
+	Scan		scan;
+	int			num_workers;
+} Funnel;
+
 /* ----------------
  *		index scan node
  *
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 79bed33..0ec1a45 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -761,6 +761,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct FunnelPath
+{
+	Path		path;
+	Path	   *subpath;		/* path for each worker */
+	int			num_workers;
+} FunnelPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index dd43e45..8ca7d0f 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,13 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We have kept reasonably high value for default parallel
+ * setup cost. In future we might want to change this value based
+ * on results.
+ */
+#define DEFAULT_PARALLEL_SETUP_COST  1000.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +55,11 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	parallel_seqscan_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -70,6 +80,8 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 161644c..29b344f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,9 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern FunnelPath *create_funnel_path(PlannerInfo *root,
+				   RelOptInfo *rel, Path *subpath, Relids required_outer,
+				   int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 52b077a..67a8582 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -133,6 +133,7 @@ extern bool query_is_distinct_for(Query *query, List *colnos, List *opids);
  */
 extern Plan *set_plan_references(PlannerInfo *root, Plan *plan);
 extern void fix_opfuncids(Node *node);
+extern void fix_node_funcids(Plan *node);
 extern void set_opfuncid(OpExpr *opexpr);
 extern void set_sa_opfuncid(ScalarArrayOpExpr *opexpr);
 extern void record_plan_function_dependency(PlannerInfo *root, Oid funcid);
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index b10a504..623223c 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -14,6 +14,7 @@
 #ifndef PLANNER_H
 #define PLANNER_H
 
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "nodes/relation.h"
 
@@ -29,6 +30,8 @@ extern PlannedStmt *planner(Query *parse, int cursorOptions,
 		ParamListInfo boundParams);
 extern PlannedStmt *standard_planner(Query *parse, int cursorOptions,
 				 ParamListInfo boundParams);
+extern PlannedStmt *create_parallel_worker_plannedstmt(Plan *plan,
+								   List *rangetable, int num_exec_params);
 
 extern Plan *subquery_planner(PlannerGlobal *glob, Query *parse,
 				 PlannerInfo *parent_root,
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index b560672..91acd60 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -104,7 +104,9 @@ typedef enum
  *		pointers that the executor must call.
  *
  * Note: the receiveSlot routine must be passed a slot containing a TupleDesc
- * identical to the one given to the rStartup routine.
+ * identical to the one given to the rStartup routine.  It returns bool where
+ * a "true" value means "continue processing" and a "false" value means
+ * "stop early, just as if we'd reached the end of the scan".
  * ----------------
  */
 typedef struct _DestReceiver DestReceiver;
@@ -112,7 +114,7 @@ typedef struct _DestReceiver DestReceiver;
 struct _DestReceiver
 {
 	/* Called for each tuple to be output: */
-	void		(*receiveSlot) (TupleTableSlot *slot,
+	bool		(*receiveSlot) (TupleTableSlot *slot,
 											DestReceiver *self);
 	/* Per-executor-run initialization and shutdown: */
 	void		(*rStartup) (DestReceiver *self,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 96c5b8b..6f319c1 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -19,6 +19,7 @@
 #ifndef TCOPPROT_H
 #define TCOPPROT_H
 
+#include "executor/execParallel.h"
 #include "nodes/params.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
@@ -84,5 +85,6 @@ extern void set_debug_options(int debug_flag,
 extern bool set_plan_disabling_options(const char *arg,
 						   GucContext context, GucSource source);
 extern const char *get_stats_option_name(const char *arg);
+extern void exec_parallel_stmt(ParallelStmt *parallelscan);
 
 #endif   /* TCOPPROT_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7a58ddb..3505d31 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a037f81..ad2a33f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -707,6 +707,9 @@ FunctionParameterMode
 FunctionScan
 FunctionScanPerFuncState
 FunctionScanState
+Funnel
+FunnelPath
+FunnelState
 FuzzyAttrMatchState
 GBT_NUMKEY
 GBT_NUMKEY_R
@@ -1195,6 +1198,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+parallel_estimate_ctx
+ParallelStmt
 PATH
 PBOOL
 PCtxtHandle
parallel_seqscan_partialseqscan_v18.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v18.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..096166e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,12 +81,16 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan,
+								bool *pscan_finished);
+static void heap_parallelscan_initialize_startblock(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -226,7 +231,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -272,7 +280,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	else if (allow_sync && synchronize_seqscans)
 	{
 		scan->rs_syncscan = true;
-		scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+		if (scan->rs_parallel != NULL)
+			heap_parallelscan_initialize_startblock(scan);
+		else
+			scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
 	}
 	else
 	{
@@ -496,7 +507,32 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker won't
+				 * need to perform scan.  Report scan location before finishing the
+				 * scan so that the final state of the position hint is back at the
+				 * start of the rel.
+				 */
+				if (pscan_finished)
+				{
+					if (scan->rs_syncscan)
+						ss_report_location(scan->rs_rd, page);
+
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -519,6 +555,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -671,11 +710,22 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished = false;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+				finished = pscan_finished;
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+						   (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -773,7 +823,32 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker won't
+				 * need to perform scan.  Report scan location before finishing the
+				 * scan so that the final state of the position hint is back at the
+				 * start of the rel.
+				 */
+				if (pscan_finished)
+				{
+					if (scan->rs_syncscan)
+						ss_report_location(scan->rs_rd, page);
+
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock; /* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -793,6 +868,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -934,11 +1012,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				bool	pscan_finished = false;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+				finished = pscan_finished;
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+						   (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1341,7 +1430,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1351,7 +1440,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1360,7 +1449,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1369,7 +1458,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1378,7 +1467,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
 								   false, true, false);
 }
@@ -1386,6 +1475,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
@@ -1418,6 +1508,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1532,6 +1623,159 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = InvalidBlockNumber;
+	target->phs_firstpass = true;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize_startblock - initialize the startblock for
+ *					parallel scan.
+ *
+ *		Only the first worker of parallel scan will initialize the start
+ *		block for scan and others will use that information to indicate
+ *		the end of scan.
+ * ----------------
+ */
+static void
+heap_parallelscan_initialize_startblock(HeapScanDesc scan)
+{
+	ParallelHeapScanDesc parallel_scan;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	/*
+	 * InvalidBlockNumber indicates that this initialization is done for
+	 * first worker.
+	 */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock == InvalidBlockNumber)
+	{
+		scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+		parallel_scan->phs_cblock = scan->rs_startblock;
+		parallel_scan->phs_startblock = scan->rs_startblock;
+	}
+	else
+		scan->rs_startblock = parallel_scan->phs_startblock;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+}
+
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		Scanning till the position from where the parallel scan has started
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ *		Resets the current position for parallel scan to the begining of
+ *		relation, if next page to scan is greater than total number of pages in
+ *		relation.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan,
+						   bool *pscan_finished)
+{
+	BlockNumber	page = InvalidBlockNumber;
+	ParallelHeapScanDesc parallel_scan;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	*pscan_finished = false;
+
+	/* we treat InvalidBlockNumber specially here to avoid overflow */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	if (parallel_scan->phs_cblock != InvalidBlockNumber)
+		page = parallel_scan->phs_cblock++;
+
+	if (page >= scan->rs_nblocks)
+	{
+		parallel_scan->phs_cblock = 0;
+		page = parallel_scan->phs_cblock++;
+	}
+
+	/*
+	 * scan position will be same as start position once during start
+	 * of scan and then at end of scan.
+	 */
+	if (parallel_scan->phs_firstpass && page == parallel_scan->phs_startblock)
+		parallel_scan->phs_firstpass = false;
+	else if (!parallel_scan->phs_firstpass && page == parallel_scan->phs_startblock)
+	{
+		*pscan_finished = true;
+		parallel_scan->phs_cblock--;
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot		snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e99ad56..9fe5d91 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -731,6 +731,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -855,6 +856,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_Funnel:
 			pname = sname = "Funnel";
 			break;
@@ -1008,6 +1012,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_BitmapHeapScan:
 		case T_TidScan:
@@ -1282,6 +1287,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2358,6 +2364,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index fb34864..9ccb40e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -20,8 +20,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4915151..dc45c20 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -161,6 +162,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_FunnelState:
 			ExecReScanFunnel((FunnelState *) node);
 			break;
@@ -473,6 +478,7 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_Funnel:
+		case T_PartialSeqScan:
 			return false;
 
 		case T_IndexScan:
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index 650fcc5..7a44462 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_PartialSeqScanState:
 		case T_FunnelState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 726bc7f..e24478a 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -17,6 +17,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/nodeFunnel.h"
+#include "executor/nodePartialSeqscan.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
 #include "optimizer/planner.h"
@@ -297,6 +298,18 @@ ExecParallelEstimate(Node *node, parallel_estimate_ctx *pestcontext)
 
 	switch (nodeTag(node))
 	{
+		case T_PartialSeqScanState:
+			{
+				EState	   *estate = ((PartialSeqScanState *) node)->ss.ps.state;
+
+				*pestcontext->psize = heap_parallelscan_estimate(estate->es_snapshot);
+				shm_toc_estimate_chunk(&pestcontext->context->estimator, *pestcontext->psize);
+
+				/* key for paratial scan information. */
+				shm_toc_estimate_keys(&pestcontext->context->estimator, 1);
+				return true;
+			}
+			break;
 		default:
 			break;
 	}
@@ -312,11 +325,30 @@ ExecParallelEstimate(Node *node, parallel_estimate_ctx *pestcontext)
 static bool
 ExecParallelInitializeDSM(Node *node, parallel_estimate_ctx *pestcontext)
 {
+	ParallelHeapScanDesc pscan;
+
 	if (node == NULL)
 		return false;
 
 	switch (nodeTag(node))
 	{
+		case T_PartialSeqScanState:
+			{
+				EState	   *estate = ((PartialSeqScanState *) node)->ss.ps.state;
+
+				/*
+				 * Store parallel heap scan descriptor in dynamic shared
+				 * memory.
+				 */
+				pscan = shm_toc_allocate(pestcontext->context->toc,
+										 *pestcontext->psize);
+				heap_parallelscan_initialize(pscan,
+					   ((PartialSeqScanState *) node)->ss.ss_currentRelation,
+											 estate->es_snapshot);
+				shm_toc_insert(pestcontext->context->toc, PARALLEL_KEY_SCAN, pscan);
+				return true;
+			}
+			break;
 		default:
 			break;
 	}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c181bf2..e24a439 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeFunnel.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -197,6 +198,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_Funnel:
 			result = (PlanState *) ExecInitFunnel((Funnel *) node,
 												  estate, eflags);
@@ -422,6 +428,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_FunnelState:
 			result = ExecFunnel((FunnelState *) node);
 			break;
@@ -668,6 +678,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_FunnelState:
 			ExecEndFunnel((FunnelState *) node);
 			break;
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..e4a125a
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check are
+	 * keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	shm_toc    *toc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it from
+	 * shared memory.  We set 'toc' (place to lookup parallel scan descriptor)
+	 * as retrievied by attaching to dsm for parallel workers whereas master
+	 * backend stores it directly in partial scan state node after
+	 * initializing workers.
+	 */
+	toc = GetParallelShmToc();
+	if (toc)
+		node->ss.ps.toc = toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize it
+	 * during ExecutorStart phase, however we need ParallelHeapScanDesc to
+	 * initialize the scan in case of this node and the same is initialized by
+	 * the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic
+		 * shared memory segment by master backend, parallel workers and local
+		 * scan by master backend retrieve it from shared memory.  If the scan
+		 * descriptor is available on first execution, then we need to
+		 * re-initialize for rescan.
+		 */
+		Assert(node->ss.ps.toc);
+
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b348bfd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -75,6 +75,13 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Result node can be added as a gating node on top of PartialSeqScan
+	 * node, so need to percolate toc information to outer node.
+	 */
+	if (node->ps.toc)
+		outerPlanState(node)->toc = node->ps.toc;
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ec605bd..e62fd14 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -382,6 +382,22 @@ _copySampleScan(const SampleScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copyFunnel
  */
 static Funnel *
@@ -4261,6 +4277,9 @@ copyObject(const void *from)
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_Funnel:
 			retval = _copyFunnel(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index bc9b481..8a145d9 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -458,6 +458,14 @@ _outSampleScan(StringInfo str, const SampleScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outFunnel(StringInfo str, const Funnel *node)
 {
 	WRITE_NODE_TYPE("FUNNEL");
@@ -3018,6 +3026,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_Funnel:
 				_outFunnel(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ef88209..95a8503 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1665,6 +1665,19 @@ _readSampleScan(void)
 }
 
 /*
+ * _readPartialSeqScan
+ */
+static PartialSeqScan *
+_readPartialSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(PartialSeqScan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
  * _readIndexScan
  */
 static IndexScan *
@@ -2366,6 +2379,8 @@ parseNodeString(void)
 		return_value = _readSeqScan();
 	else if (MATCH("SAMPLESCAN", 10))
 		return_value = _readSampleScan();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readPartialSeqScan();
 	else if (MATCH("INDEXSCAN", 9))
 		return_value = _readIndexScan();
 	else if (MATCH("INDEXONLYSCAN", 13))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 0a40f6c..5e8cd56 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -296,6 +296,50 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_partialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan
+	 * via the ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * partial sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_funnel
  *	  Determines and returns the cost of funnel path.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..a5a25cd
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (parallel_seqscan_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	num_parallel_workers = Min(parallel_seqscan_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the funnel path which master backend needs to execute. */
+	add_path(rel, (Path *) create_funnel_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e987922..67d74db 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						List *tlist, List *scan_clauses);
 static Funnel *create_funnel_plan(PlannerInfo *root,
 				   FunnelPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
@@ -106,6 +108,8 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+									Index scanrelid);
 static Funnel *make_funnel(List *qptlist, List *qpqual,
 			Index scanrelid, int nworkers,
 			Plan *subplan);
@@ -239,6 +243,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -365,6 +370,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												   scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -569,6 +581,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_Funnel:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -1204,6 +1217,46 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan    *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_funnel_plan
  *
  * Returns a funnel plan for the base relation scanned by
@@ -3532,6 +3585,24 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static Funnel *
 make_funnel(List *qptlist,
 			List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 2cff5a9..90b611b 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -442,6 +442,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
@@ -2309,6 +2310,11 @@ fix_node_funcids(Plan *node)
 
 	switch (nodeTag(node))
 	{
+		case T_Result:
+			fix_opfuncids((Node*) (((Result *)node)->resconstantqual));
+			break;
+		case T_PartialSeqScan:
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
 			break;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 073a7f5..37b5909 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2243,6 +2243,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
+		case T_PartialSeqScan:
 		case T_Funnel:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c886075..628fb84 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -732,6 +732,28 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_partialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_funnel_path
  *
  *	  Creates a path corresponding to a funnel scan, returning the
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 75e6b72..ead8411 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -121,11 +122,17 @@ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6e62319..f962f83 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,17 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber	phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	BlockNumber phs_startblock;
+	bool		phs_firstpass;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +60,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel; /* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cec09ad
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+					   EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ee82999..508f69d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1282,6 +1282,16 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+/*
  * FunnelState extends ScanState by storing additional information
  * related to parallel workers.
  *		pcxt				parallel context for managing generic state information
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 48f7160..039f053 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_PartialSeqScan,
 	T_Funnel,
 	T_IndexScan,
 	T_IndexOnlyScan,
@@ -100,6 +101,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_PartialSeqScanState,
 	T_FunnelState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 69302af..2d25a01 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -296,6 +296,12 @@ typedef struct SampleScan
 	struct TableSampleClause *tablesample;
 } SampleScan;
 
+/* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
 /* ------------
  *		Funnel node
  * ------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 8ca7d0f..4ecafd4 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -80,6 +80,9 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_partialseqscan(Path *path, PlannerInfo *root,
+						RelOptInfo *baserel, ParamPathInfo *param_info,
+						int nworkers);
 extern void cost_funnel(FunnelPath *path, PlannerInfo *root,
 			RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 29b344f..8c8a1a2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,8 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						Relids required_outer, int nworkers);
 extern FunnelPath *create_funnel_path(PlannerInfo *root,
 				   RelOptInfo *rel, Path *subpath, Relids required_outer,
 				   int nworkers);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..e7db9ab 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,14 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+								Relids required_outer);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ad2a33f..f06c7c2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1200,6 +1200,8 @@ PACE_HEADER
 PACL
 parallel_estimate_ctx
 ParallelStmt
+PartialSeqScan
+PartialSeqScanState
 PATH
 PBOOL
 PCtxtHandle
#353Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#352)
Re: Parallel Seq Scan

On Thu, Sep 24, 2015 at 2:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have fixed most of the review comments raised in this mail
as well as other e-mails and rebased the patch on commit-
020235a5. Even though I have fixed many of the things, but
still quite a few comments are yet to be handled. This patch
is mainly a rebased version to ease the review. We can continue
to have discussion on the left over things and I will address
those in consecutive patches.

Thanks for the update. Here are some more review comments:

1. parallel_seqscan_degree is still not what we should have here. As
previously mentioned, I think we should have a GUC for the maximum
degree of parallelism in a query generally, not the maximum degree of
parallel sequential scan.

2. fix_node_funcids() can go away because of commit
9f1255ac859364a86264a67729dbd1a36dd63ff2.

3. cost_patialseqscan is still misspelled. I pointed this out before, too.

4. Much more seriously than any of the above,
create_parallelscan_paths() generates plans that are badly broken:

rhaas=# explain select * from pgbench_accounts where filler < random()::text;
QUERY PLAN
-----------------------------------------------------------------------------------------
Funnel on pgbench_accounts (cost=0.00..35357.73 rows=3333333 width=97)
Filter: ((filler)::text < (random())::text)
Number of Workers: 10
-> Partial Seq Scan on pgbench_accounts (cost=0.00..35357.73
rows=3333333 width=97)
Filter: ((filler)::text < (random())::text)
(5 rows)

This is wrong both because random() is parallel-restricted and thus
can't be executed in a parallel worker, and also because enforcing a
volatile qual twice is no good.

rhaas=# explain select * from pgbench_accounts where aid % 10000 = 0;
QUERY PLAN
---------------------------------------------------------------------------------------
Funnel on pgbench_accounts (cost=0.00..28539.55 rows=50000 width=97)
Filter: ((aid % 10000) = 0)
Number of Workers: 10
-> Partial Seq Scan on pgbench_accounts (cost=0.00..28539.55
rows=50000 width=97)
Filter: ((aid % 10000) = 0)
(5 rows)

This will work, but it's a bad plan because we shouldn't need to
re-apply the filter condition in the parallel leader after we've
already checked it in the worker.

rhaas=# explain select * from pgbench_accounts a where a.bid not in
(select bid from pgbench_branches);
QUERY PLAN
-------------------------------------------------------------------------------------------
Funnel on pgbench_accounts a (cost=2.25..26269.07 rows=5000000 width=97)
Filter: (NOT (hashed SubPlan 1))
Number of Workers: 10
-> Partial Seq Scan on pgbench_accounts a (cost=2.25..26269.07
rows=5000000 width=97)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on pgbench_branches (cost=0.00..2.00 rows=100 width=4)
SubPlan 1
-> Seq Scan on pgbench_branches (cost=0.00..2.00 rows=100 width=4)
(9 rows)

This will not work, because the subplan isn't available inside the
worker. Trying it without the EXPLAIN crashes the server. This is
more or less the same issue as one of the known issues you already
mentioned, but I mention it again here because I think this case is
closely related to the previous two.

Basically, you need to have some kind of logic for deciding which
things need to go below the funnel and which on the funnel itself.
The stuff that's safe should get pushed down, and the stuff that's not
safe should get attached to the funnel. The unsafe stuff is whatever
contains references to initplans or subplans, or anything that
contains parallel-restricted functions ... and there might be some
other stuff, too, but at least those things.

Instead of preventing initplans or subplans from getting pushed down
into the funnel, we could instead try to teach the system to push them
down. However, that's very complicated; e.g. a subplan that
references a CTE isn't safe to push down, and a subplan that
references another subplan must be pushed down if that other subplan
is pushed down, and an initplan that contains volatile functions can't
be pushed down because each worker would execute it separately and
they might not all get the same answer, and an initplan that
references a temporary table can't be pushed down because it can't be
referenced from a worker. All in all, it seems better not to go there
right now.

Now, when you fix this, you're quickly going to run into another
problem, which is that as you have this today, the funnel node does
not actually enforce its filter condition, so the EXPLAIN plan is a
dastardly lie. And when you try to fix that, you're going to run into
a third problem, which is that the stuff the funnel node needs in
order to evaluate its filter condition might not be in the partial seq
scan's target list. So you need to fix both of those problems, too.
Even if you cheat and just don't generate a parallel path at all
except when all quals can be pushed down, you're still going to have
to fix these problems: it's not OK to print out a filter condition on
the funnel as if you were going to enforce it, and then not actually
enforce it there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#354Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#350)
Re: Parallel Seq Scan

On Thu, Sep 24, 2015 at 12:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

[ new patches ]

It looks to me like there would be trouble if an initPlan or subPlan
were kept below a Funnel, or as I guess we're going to call it, a
Gather node. That's because a SubPlan doesn't actually have a pointer
to the node tree for the sub-plan in it. It just has an index into
PlannedStmt.subplans. But create_parallel_worker_plannedstmt sets the
subplans list to NIL. So that's not gonna work. Now maybe there's no
way for an initPlan or a subPlan to creep down under the funnel, but I
don't immediately see what would prevent it.

I think initPlan will work with the existing patches as we are always
executing it in master and then sending the result to workers. Refer
below code in funnel patch:

ExecFunnel()
{
..

+ /*

+ * Evaluate the InitPlan and pass the PARAM_EXEC params, so that

+ * values can be shared with worker backend. This is different from

+ * the way InitPlans are evaluated (lazy evaluation) at other places

+ * as instead of sharing the InitPlan to all the workers and let them

+ * execute, we pass the values which can be directly used by worker

+ * backends.

+ */

+ serialized_param_exec = ExecAndFormSerializeParamExec(econtext,

+ node->ss.ps.plan->lefttree->allParam);
}

For Subplan, as mentioned in yesterday's mail it is still to be dealt
by patch, but I think if we assign the subplans to the planned statement
we are passing to worker, it should work. Here we need to avoid un-safe
subplans to be pushed down, some more thoughts are required to see
what exactly needs to be done.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#355Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#353)
Re: Parallel Seq Scan

On Fri, Sep 25, 2015 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 24, 2015 at 2:31 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have fixed most of the review comments raised in this mail
as well as other e-mails and rebased the patch on commit-
020235a5. Even though I have fixed many of the things, but
still quite a few comments are yet to be handled. This patch
is mainly a rebased version to ease the review. We can continue
to have discussion on the left over things and I will address
those in consecutive patches.

Thanks for the update. Here are some more review comments:

1. parallel_seqscan_degree is still not what we should have here. As
previously mentioned, I think we should have a GUC for the maximum
degree of parallelism in a query generally, not the maximum degree of
parallel sequential scan.

Agreed and upthread I have asked about your opinion after proposing a
suitable name.

Some suitable names could be:
degree_of_parallelism, max_degree_of_parallelism, degree_of_query_
parallelism, max_degree_of_query_parallelism

2. fix_node_funcids() can go away because of commit
9f1255ac859364a86264a67729dbd1a36dd63ff2.

Agreed.

3. cost_patialseqscan is still misspelled. I pointed this out before,

too.

In the latest patch (parallel_seqscan_partialseqscan_v18.patch) posted by
me yesterday, this was fixed. Am I missing something or by any chance
you are referring to wrong version of patch

4. Much more seriously than any of the above,
create_parallelscan_paths() generates plans that are badly broken:

rhaas=# explain select * from pgbench_accounts where filler <

random()::text;

QUERY PLAN

-----------------------------------------------------------------------------------------

Funnel on pgbench_accounts (cost=0.00..35357.73 rows=3333333 width=97)
Filter: ((filler)::text < (random())::text)
Number of Workers: 10
-> Partial Seq Scan on pgbench_accounts (cost=0.00..35357.73
rows=3333333 width=97)
Filter: ((filler)::text < (random())::text)
(5 rows)

This is wrong both because random() is parallel-restricted and thus
can't be executed in a parallel worker, and also because enforcing a
volatile qual twice is no good.

Yes, the patch needs more work in terms of dealing with parallel-restricted
expressions/functions. One idea which I have explored previously is
push down only safe clauses to workers (via partialseqscan node) and
execute restricted clauses in master (via Funnel node). My analysis
is as follows:

Usage of restricted functions in quals-
During create_plan() phase, separate out the quals that needs to be
executed at funnel node versus quals that needs to be executed on
partial seq scan node (do something similar to what is done in
create_indexscan_plan for index and non-index quals).

Basically PartialSeqScan node can contain two different list of quals,
one for non-restrictive quals and other for restrictive quals and then
Funnel node can retrieve restrictive quals from partialseqscan node,
assuming partialseqscan node is its left child.

Now, I think the above can only be possible under the assumption that
partialseqscan node is always a left child of funnel node, however that is
not true because a gating node (Result node) can be added between the
two and tomorrow there could be more cases when other nodes will be
added between the two, if we consider the case of aggregation, the
situation will be more complex as before partial aggregation, all the
quals should be executed.

Unless there is a good way to achieve the partial execution of quals,
I think it is better to prohibit parallel plans, if there is any restrictive
clause. Yet another way could be don't push qual if it contains restricted
functions and execute it at Funnel node, but I think that will be quite
costly due additional flow of tuples from worker backends.

Usage of restricted functions in target list -
One way could be if target list contains any restricted function, then
parallel
worker needs to always send the complete tuple and the target list will be
evaluated by master backend during funnel node execution. I think some
of restrictions that are applies to quals apply to this also especially if
there
are other nodes like aggregation in-between Funnel and partialseqscan.

rhaas=# explain select * from pgbench_accounts where aid % 10000 = 0;
QUERY PLAN

---------------------------------------------------------------------------------------

Funnel on pgbench_accounts (cost=0.00..28539.55 rows=50000 width=97)
Filter: ((aid % 10000) = 0)
Number of Workers: 10
-> Partial Seq Scan on pgbench_accounts (cost=0.00..28539.55
rows=50000 width=97)
Filter: ((aid % 10000) = 0)
(5 rows)

This will work, but it's a bad plan because we shouldn't need to
re-apply the filter condition in the parallel leader after we've
already checked it in the worker.

It's only shown in Explain plan, but it never executes quals at funnel node.
This needs to be changed and the change will depend on previous case.

rhaas=# explain select * from pgbench_accounts a where a.bid not in
(select bid from pgbench_branches);
QUERY PLAN

-------------------------------------------------------------------------------------------

Funnel on pgbench_accounts a (cost=2.25..26269.07 rows=5000000 width=97)
Filter: (NOT (hashed SubPlan 1))
Number of Workers: 10
-> Partial Seq Scan on pgbench_accounts a (cost=2.25..26269.07
rows=5000000 width=97)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on pgbench_branches (cost=0.00..2.00 rows=100

width=4)

SubPlan 1
-> Seq Scan on pgbench_branches (cost=0.00..2.00 rows=100 width=4)
(9 rows)

This will not work, because the subplan isn't available inside the
worker. Trying it without the EXPLAIN crashes the server. This is
more or less the same issue as one of the known issues you already
mentioned, but I mention it again here because I think this case is
closely related to the previous two.

Basically, you need to have some kind of logic for deciding which
things need to go below the funnel and which on the funnel itself.
The stuff that's safe should get pushed down, and the stuff that's not
safe should get attached to the funnel. The unsafe stuff is whatever
contains references to initplans or subplans, or anything that
contains parallel-restricted functions ... and there might be some
other stuff, too, but at least those things.

Yes, another thing as we discussed previously that needs to be prohibited
is nested-funnel nodes.

Instead of preventing initplans or subplans from getting pushed down
into the funnel, we could instead try to teach the system to push them
down. However, that's very complicated; e.g. a subplan that
references a CTE isn't safe to push down, and a subplan that
references another subplan must be pushed down if that other subplan
is pushed down, and an initplan that contains volatile functions can't
be pushed down because each worker would execute it separately and
they might not all get the same answer, and an initplan that
references a temporary table can't be pushed down because it can't be
referenced from a worker. All in all, it seems better not to go there
right now.

As mentioned in my previous mail, I think initplans should work as we
are executing them in master backend, however handling of subplans
needs more thoughts.

Now, when you fix this, you're quickly going to run into another
problem, which is that as you have this today, the funnel node does
not actually enforce its filter condition, so the EXPLAIN plan is a
dastardly lie. And when you try to fix that, you're going to run into
a third problem, which is that the stuff the funnel node needs in
order to evaluate its filter condition might not be in the partial seq
scan's target list. So you need to fix both of those problems, too.
Even if you cheat and just don't generate a parallel path at all
except when all quals can be pushed down, you're still going to have
to fix these problems: it's not OK to print out a filter condition on
the funnel as if you were going to enforce it, and then not actually
enforce it there.

Agreed and fix depends on what we decide to do for un-safe quals.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#356Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#351)
Re: Parallel Seq Scan

On Thu, Sep 24, 2015 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 3, 2015 at 6:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

[ new patches ]

Still more review comments:

+ /* Allow space for terminating zero-byte

*/

+ size = add_size(size, 1);

This is pointless. The length is already stored separately, and if it
weren't, this wouldn't be adequate anyway because a varlena can
contain NUL bytes. It won't if it's text, but it could be bytea or
numeric or whatever.

Agreed and I think we can do without that as well.

RestoreBoundParams is broken, because it can do unaligned reads, which
will core dump on some architectures (and merely be slow on others).
If there are two or more parameters, and the first one is a varlena
with a length that is not a multiple of MAXIMUM_ALIGNOF, the second
SerializedParamExternData will be misaligned.

Agreed, but can't we fix that problem even in the current patch by using
*ALIGN (TYPEALIGN) macro?

Also, it's pretty lame that we send the useless pointer even for a
pass-by-reference data type and then overwrite the bad pointer with a
good one a few lines later. It would be better to design the
serialization format so that we don't send the bogus pointer over the
wire in the first place.

It's also problematic in my view that there is so much duplicated code
here. SerializedParamExternData and SerializedParamExecData are very
similar and there are large swaths of very similar code to handle each
case. Both structures contain Datum value, Size length, bool isnull,
and Oid ptype, albeit not in the same order for some reason. The only
difference is that SerializedParamExternData contains uint16 pflags
and SerializedParamExecData contains int paramid. I think we need to
refactor this code to get rid of all this duplication.

The reason of having different structures is that both the structures are
derived from their non-serialized versions (ParamExternData and
ParamExecData). I agree that we can avoid code duplication incase of
serialization and restoration of both structures, however doing it with
structure compatible to non-serialized version seems to have better code
readability and easier to enhance (think if we need to add a new parameter
in non-serialized version). This can somewhat vary from each individual's
perspective and I won't argue much here if you say that having native format
as you are suggesting is better in terms of code readability and
maintenance.

I suggest that
we decide to represent a datum here in a uniform fashion, perhaps like
this:

First, store a 4-byte header word. If this is -2, the value is NULL
and no data follows. If it's -1, the value is pass-by-value and
sizeof(Datum) bytes follow. If it's >0, the value is
pass-by-reference and the value gives the number of following bytes
that should be copied into a brand-new palloc'd chunk.

Using a format like this, we can serialize and restore datums in
various contexts - including bind and exec params - without having to
rewrite the code each time. For example, for param extern data, you
can dump an array of all the ptypes and paramids and then follow it
with all of the params one after another. For param exec data, you
can dump an array of all the ptypes and paramids and then follow it
with the values one after another. The code that reads and writes the
datums in both cases can be the same. If we need to send datums in
other contexts, we can use the same code for it.

The attached patch - which I even tested! - shows what I have in mind.
It can save and restore the ParamListInfo (bind parameters). I
haven't tried to adapt it to the exec parameters because I don't quite
understand what you are doing there yet, but you can see that the
datum-serialization logic is separated from the stuff that knows about
the details of ParamListInfo, so datumSerialize() should be reusable
for other purposes.

I think we can adapt this for Param exec data parameters. I can take
care of integrating this with funnel patch and changing the param exec data
parameters related code if we decide to go via this way.

This also doesn't have the other problems
mentioned above.

I have a question here which is why this format doesn't have a similar
problem
as the current version, basically in current patch the second read of
SerializedParamExternData can be misaligned and for same reason in your
patch the second read of Oid could by misaligned?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#357Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#354)
Re: Parallel Seq Scan

On Fri, Sep 25, 2015 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think initPlan will work with the existing patches as we are always
executing it in master and then sending the result to workers. Refer
below code in funnel patch:

Sure, *if* that's what we're doing, then it will work. But if an
initPlan actually attaches below a funnel, then it will break. Maybe
that can't happen; I'm just sayin' ...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#358Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#355)
Re: Parallel Seq Scan

On Fri, Sep 25, 2015 at 12:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

In the latest patch (parallel_seqscan_partialseqscan_v18.patch) posted by
me yesterday, this was fixed. Am I missing something or by any chance
you are referring to wrong version of patch

You're right, I'm wrong. Sorry.

Yes, the patch needs more work in terms of dealing with parallel-restricted
expressions/functions. One idea which I have explored previously is
push down only safe clauses to workers (via partialseqscan node) and
execute restricted clauses in master (via Funnel node). My analysis
is as follows:

Usage of restricted functions in quals-
During create_plan() phase, separate out the quals that needs to be
executed at funnel node versus quals that needs to be executed on
partial seq scan node (do something similar to what is done in
create_indexscan_plan for index and non-index quals).

Basically PartialSeqScan node can contain two different list of quals,
one for non-restrictive quals and other for restrictive quals and then
Funnel node can retrieve restrictive quals from partialseqscan node,
assuming partialseqscan node is its left child.

Now, I think the above can only be possible under the assumption that
partialseqscan node is always a left child of funnel node, however that is
not true because a gating node (Result node) can be added between the
two and tomorrow there could be more cases when other nodes will be
added between the two, if we consider the case of aggregation, the
situation will be more complex as before partial aggregation, all the
quals should be executed.

What's the situation where the gating Result node sneaks in there?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#359Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#356)
Re: Parallel Seq Scan

On Fri, Sep 25, 2015 at 7:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have a question here which is why this format doesn't have a similar
problem
as the current version, basically in current patch the second read of
SerializedParamExternData can be misaligned and for same reason in your
patch the second read of Oid could by misaligned?

memcpy() can cope with unaligned data; structure member assignment can't.

I've worked some of this code over fairly heavily today and I'm pretty
happy with how my copy of execParallel.c looks now, but I've run into
one issue where I wanted to check with you. There are three places
where Instrumentation can be attached to a query: a ResultRelInfo's
ri_TrigInstrument (which doesn't matter for us because we don't
support parallel write queries, and triggers don't run on reads), a
PlanState's instrument, and a QueryDesc's total time.

Your patch makes provision to copy ONE Instrumentation structure per
worker back to the parallel leader. I assumed this must be the
QueryDesc's totaltime, but it looks like it's actually the PlanState
of the top node passed to the worker. That's of course no good if we
ever push more than one node down to the worker, which we may very
well want to do in the initial version, and surely want to do
eventually. We can't just deal with the top node and forget all the
others. Is that really what's happening here, or am I confused?

Assuming I'm not confused, I'm planning to see about fixing this...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#360Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#358)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 25, 2015 at 12:55 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Yes, the patch needs more work in terms of dealing with

parallel-restricted

expressions/functions. One idea which I have explored previously is
push down only safe clauses to workers (via partialseqscan node) and
execute restricted clauses in master (via Funnel node). My analysis
is as follows:

Usage of restricted functions in quals-
During create_plan() phase, separate out the quals that needs to be
executed at funnel node versus quals that needs to be executed on
partial seq scan node (do something similar to what is done in
create_indexscan_plan for index and non-index quals).

Basically PartialSeqScan node can contain two different list of quals,
one for non-restrictive quals and other for restrictive quals and then
Funnel node can retrieve restrictive quals from partialseqscan node,
assuming partialseqscan node is its left child.

Now, I think the above can only be possible under the assumption that
partialseqscan node is always a left child of funnel node, however that

is

not true because a gating node (Result node) can be added between the
two and tomorrow there could be more cases when other nodes will be
added between the two, if we consider the case of aggregation, the
situation will be more complex as before partial aggregation, all the
quals should be executed.

What's the situation where the gating Result node sneaks in there?

The plan node structure will be something like
Funnel->Result->PartialSeqScan. Both Result and PartialSeqScan are left
children in this structure. Assuming at PartialSeqScan node, we have
identified the separate lists of Restrictive and Safe clauses (while forming
PartialSeqScan node, we can go through each of the clause and separate
out the safe and restrictive clause list with logic similar to what we
follow in
create_indexscan_plan(). here we need to be careful that if there is an OR
condition between restrictive and safe expression, both the expressions
should
be executed at funnel node.), we need to propagate that information to
Funnel
node while creating funnel plan. In this case, while forming Funnel plan
node,
we need to traverse the left children and if there is any node (in this case
PartialSeqScan node) which has restrictive clauses, we need to pull those to
Funnel node. So, the situation is that we need to traverse the whole plan
tree
and pull up the restrictive clauses from the nodes (which contain
them) beneath
Funnel node.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#361Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#359)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 6:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 25, 2015 at 7:46 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have a question here which is why this format doesn't have a similar
problem
as the current version, basically in current patch the second read of
SerializedParamExternData can be misaligned and for same reason in your
patch the second read of Oid could by misaligned?

memcpy() can cope with unaligned data; structure member assignment can't.

So doesn't coping means, it anyways have to have to pay the performance
penality to make it equivalent to aligned address access. Apart from that,
today I had read about memcpy's behaviour incase of unaligned address,
it seems from some of the information on net that it could be unsafe
[1]: http://gcc.gnu.org/ml/gcc-bugs/2000-03/msg00155.html

I've worked some of this code over fairly heavily today and I'm pretty
happy with how my copy of execParallel.c looks now, but I've run into
one issue where I wanted to check with you. There are three places
where Instrumentation can be attached to a query: a ResultRelInfo's
ri_TrigInstrument (which doesn't matter for us because we don't
support parallel write queries, and triggers don't run on reads), a
PlanState's instrument, and a QueryDesc's total time.

Your patch makes provision to copy ONE Instrumentation structure per
worker back to the parallel leader. I assumed this must be the
QueryDesc's totaltime, but it looks like it's actually the PlanState
of the top node passed to the worker. That's of course no good if we
ever push more than one node down to the worker, which we may very
well want to do in the initial version, and surely want to do
eventually. We can't just deal with the top node and forget all the
others. Is that really what's happening here, or am I confused?

Yes, you have figured out correctly, I was under impression that we
will have single node execution in worker for first version and then
will extend it later.

QueryDesc's totaltime is for instrumentation information for plugin's
like pg_stat_statements and we need only the total buffer usage
of each worker to make it work as the other information is already
collected in master backend, so I think that should work as I have
written.

Assuming I'm not confused, I'm planning to see about fixing this...

Can't we just traverse the queryDesc->planstate tree and fetch/add
all the instrument information if there are multiple nodes?

[1]: http://gcc.gnu.org/ml/gcc-bugs/2000-03/msg00155.html
[2]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53016

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#362Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#357)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 5:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 25, 2015 at 12:00 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think initPlan will work with the existing patches as we are always
executing it in master and then sending the result to workers. Refer
below code in funnel patch:

Sure, *if* that's what we're doing, then it will work. But if an
initPlan actually attaches below a funnel, then it will break.

Currently, it's considered for initPlan of only left tree of Funnel, however
if we want to push multiple nodes under Funnel, then it won't work as
it is. I think even if we want to make that work, we would need to
traverse the whole tree under Funnel and do what currently is done for
each of the initPlan we encounter. In general, I think the idea of
passing the results of initPlan to workers seems like the right way
of dealing with initPlans and by the way this was a suggestion made
by you sometime back.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#363Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#361)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 12:38 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sat, Sep 26, 2015 at 6:07 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Assuming I'm not confused, I'm planning to see about fixing this...

Can't we just traverse the queryDesc->planstate tree and fetch/add
all the instrument information if there are multiple nodes?

I think the above suggestion made by me won't work, because we want
this information per node basis in master as Explain will display each
node's information separately.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#364Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#361)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 3:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

memcpy() can cope with unaligned data; structure member assignment can't.

So doesn't coping means, it anyways have to have to pay the performance
penality to make it equivalent to aligned address access. Apart from that,
today I had read about memcpy's behaviour incase of unaligned address,
it seems from some of the information on net that it could be unsafe
[1],[2].

I'm not concerned about the performance penalty for unaligned access
in this case; I'm concerned about the fact that on some platforms it
causes a segmentation fault. The links you've provided there are
examples of cases where that wasn't true, and people reported that as
a bug in memcpy.

Yes, you have figured out correctly, I was under impression that we
will have single node execution in worker for first version and then
will extend it later.

No, I really want it to work with multiple nodes from the start, and
I've pretty much got that working here now.

QueryDesc's totaltime is for instrumentation information for plugin's
like pg_stat_statements and we need only the total buffer usage
of each worker to make it work as the other information is already
collected in master backend, so I think that should work as I have
written.

I don't think that's right at all. First, an extension can choose to
look at any part of the Instrumentation, not just the buffer usage.
Secondly, the buffer usage inside QueryDesc's totaltime isn't the same
as the global pgBufferUsage.

Assuming I'm not confused, I'm planning to see about fixing this...

Can't we just traverse the queryDesc->planstate tree and fetch/add
all the instrument information if there are multiple nodes?

Well you need to add each node's information in each worker to the
corresponding node in the leader. You're not just adding them all up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#365Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#364)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 8:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

QueryDesc's totaltime is for instrumentation information for plugin's
like pg_stat_statements and we need only the total buffer usage
of each worker to make it work as the other information is already
collected in master backend, so I think that should work as I have
written.

I don't think that's right at all. First, an extension can choose to
look at any part of the Instrumentation, not just the buffer usage.
Secondly, the buffer usage inside QueryDesc's totaltime isn't the same
as the global pgBufferUsage.

Oh... but I'm wrong. As long as our local pgBufferUsage gets update
correctly to incorporate the data from the other workers, the
InstrStopNode(queryDesc->totaltime) will suck in those statistics.
And the only other things getting updated are nTuples (which shouldn't
count anything that the workers did), firsttuple (similarly), and
counter (where the behavior is a bit more arguable, but just counting
the master's wall-clock time is at least defensible). So now I think
you're right: this should be OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#366Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#365)
2 attachment(s)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 10:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Sep 26, 2015 at 8:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

QueryDesc's totaltime is for instrumentation information for plugin's
like pg_stat_statements and we need only the total buffer usage
of each worker to make it work as the other information is already
collected in master backend, so I think that should work as I have
written.

I don't think that's right at all. First, an extension can choose to
look at any part of the Instrumentation, not just the buffer usage.
Secondly, the buffer usage inside QueryDesc's totaltime isn't the same
as the global pgBufferUsage.

Oh... but I'm wrong. As long as our local pgBufferUsage gets update
correctly to incorporate the data from the other workers, the
InstrStopNode(queryDesc->totaltime) will suck in those statistics.
And the only other things getting updated are nTuples (which shouldn't
count anything that the workers did), firsttuple (similarly), and
counter (where the behavior is a bit more arguable, but just counting
the master's wall-clock time is at least defensible). So now I think
you're right: this should be OK.

OK, so here's a patch extracted from your
parallel_seqscan_partialseqscan_v18.patch with a fairly substantial
amount of rework by me:

- I left out the Funnel node itself; this is just the infrastructure
portion of the patch. I also left out the stop-the-executor early
stuff and the serialization of PARAM_EXEC values. I want to have
those things, but I think they need more thought and study first.
- I reorganized the code a fair amount into a former that I thought
was clearer, and certainly is closer to what I did previously in
parallel.c. I found your version had lots of functions with lots of
parameters, and I found that made the logic difficult to follow, at
least for me. As part of that, I munged the interface a bit so that
execParallel.c returns a structure with a bunch of pointers in it
instead of separately returning each one as an out parameter. I think
that's cleaner. If we need to add more stuff in the future, that way
we don't break existing callers.
- I reworked the interface with instrument.c and tried to preserve
something of an abstraction boundary there. I also changed the way
that stuff accumulated statistics to include more things; I couldn't
see any reason to make it as narrow as you had it.
- I did a bunch of cosmetic cleanup, including changing function names
and rewriting comments.
- I replaced your code for serializing and restoring a ParamListInfo
with my version.
- I fixed the code so that it can handle collecting instrumentation
data from multiple nodes, bringing all the data back to the leader and
associating it with the right plan node. This involved giving every
plan node a unique ID, as discussed with Tom on another recent thread.

After I did all that, I wrote some test code, which is also attached
here, that adds a new GUC force_parallel_worker. If you set that GUC,
when you run a query, it'll run the query in a parallel worker and
feed the results back to the master. I've tested this and it seems to
work, at least on the queries where you'd expect it to work. It's
just test code, so it doesn't have error checking or make any attempt
not to push down queries that will fail in parallel mode. But you can
use it to see what happens. You can also run queries under EXPLAIN
ANALYZE this way and, lo and behold, the worker stats show up attached
to the correct plan nodes.

I intend to commit this patch (but not the crappy test code, of
course) pretty soon, and then I'm going to start working on the
portion of the patch that actually adds the Funnel node, which I think
you are working on renaming to Gather. I think that getting that part
committed is likely to be pretty straightforward; it doesn't need to
do a lot more than call this stuff and tell it to go do its thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-executor-test.patchapplication/x-patch; name=parallel-executor-test.patchDownload
From 7c5fa754fada96553930820b970da876fb1dc468 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 25 Sep 2015 17:50:19 -0400
Subject: [PATCH 2/2] Test code.

---
 src/backend/executor/execMain.c     | 47 ++++++++++++++++++++++++++++++++++++-
 src/backend/executor/execParallel.c |  4 +++-
 src/backend/utils/misc/guc.c        | 11 +++++++++
 src/include/utils/guc.h             |  2 ++
 4 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 85ff46b..4863afd 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -45,6 +45,8 @@
 #include "commands/matview.h"
 #include "commands/trigger.h"
 #include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/tqueue.h"
 #include "foreign/fdwapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -339,13 +341,56 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	 * run plan
 	 */
 	if (!ScanDirectionIsNoMovement(direction))
-		ExecutePlan(estate,
+	{
+		if (force_parallel_worker && !IsParallelWorker())
+		{
+			ParallelExecutorInfo *pei;
+			TupleQueueFunnel *funnel;
+			TupleDesc tupType;
+			TupleTableSlot *slot;
+
+			EnterParallelMode();
+
+			/* Utterly disregarding sanity checks, let's try this out... */
+			pei = ExecInitParallelPlan(queryDesc->planstate, estate, 1);
+			LaunchParallelWorkers(pei->pcxt);
+
+			/* Set up to send results wherever they're supposed to go. */
+			funnel = CreateTupleQueueFunnel();
+			RegisterTupleQueueOnFunnel(funnel, pei->tqueue[0]);
+			tupType = ExecGetResultType(queryDesc->planstate);
+			slot = MakeSingleTupleTableSlot(tupType);
+
+			/* Read tuples from the worker and send them to the receiver. */
+			for (;;)
+			{
+				HeapTuple tup;
+				bool done;
+
+				tup = TupleQueueFunnelNext(funnel, false, &done);
+				if (done)
+					break;
+				Assert(tup != NULL);
+				ExecStoreTuple(tup, slot, InvalidBuffer, true);
+				(*dest->receiveSlot) (slot, dest);
+			}
+
+			/* Clean up. */
+			ExecParallelFinish(pei);
+			DestroyParallelContext(pei->pcxt);
+			ExitParallelMode();
+		}
+		else
+		{
+			ExecutePlan(estate,
 					queryDesc->planstate,
 					operation,
 					sendTuples,
 					count,
 					direction,
 					dest);
+		}
+	}
 
 	/*
 	 * shutdown tuple receiver, if we started it
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a409a9a..9c8bf4b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -568,7 +568,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	ExecutorStart(queryDesc, 0);
 	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
 	ExecutorFinish(queryDesc);
-	ExecutorEnd(queryDesc);
 
 	/* Report buffer usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE);
@@ -579,6 +578,9 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		ExecParallelReportInstrumentation(queryDesc->planstate,
 										  instrumentation);
 
+	/* Must do this after capturing instrumentation. */
+	ExecutorEnd(queryDesc);
+
 	/* Cleanup. */
 	FreeQueryDesc(queryDesc);
 	(*receiver->rDestroy) (receiver);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17053af..11c54dd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -112,6 +112,8 @@ extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
 
+bool force_parallel_worker;
+
 #ifdef TRACE_SYNCSCAN
 extern bool trace_syncscan;
 #endif
@@ -755,6 +757,15 @@ static const unit_conversion time_unit_conversion_table[] =
 static struct config_bool ConfigureNamesBool[] =
 {
 	{
+		{"force_parallel_worker", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Force use of a parallel worker in the executor."),
+			NULL
+		},
+		&force_parallel_worker,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of sequential-scan plans."),
 			NULL
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index dc167f9..702e067 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -433,4 +433,6 @@ extern void assign_search_path(const char *newval, void *extra);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
+extern bool force_parallel_worker;
+
 #endif   /* GUC_H */
-- 
2.3.8 (Apple Git-58)

parallel-executor.patchapplication/x-patch; name=parallel-executor.patchDownload
From ed7fdc90c3835b43f74f5bd1f45253ceef64b9e6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 25 Sep 2015 13:57:25 -0400
Subject: [PATCH 1/2] Parallel executor support.

---
 src/backend/executor/Makefile        |   3 +-
 src/backend/executor/execParallel.c  | 585 +++++++++++++++++++++++++++++++++++
 src/backend/executor/instrument.c    |  78 +++++
 src/backend/executor/tqueue.c        |   4 +-
 src/backend/nodes/copyfuncs.c        |   1 +
 src/backend/nodes/outfuncs.c         |   1 +
 src/backend/nodes/params.c           | 155 ++++++++++
 src/backend/nodes/readfuncs.c        |   1 +
 src/backend/optimizer/plan/planner.c |   1 +
 src/backend/optimizer/plan/setrefs.c |   5 +
 src/backend/utils/adt/datum.c        | 118 +++++++
 src/include/executor/execParallel.h  |  36 +++
 src/include/executor/instrument.h    |   5 +
 src/include/nodes/params.h           |   3 +
 src/include/nodes/plannodes.h        |   1 +
 src/include/nodes/relation.h         |   2 +
 src/include/utils/datum.h            |  10 +
 17 files changed, 1007 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/executor/execParallel.c
 create mode 100644 src/include/executor/execParallel.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 249534b..f5e1e1a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -13,7 +13,8 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execProcnode.o execQual.o execScan.o execTuples.o \
+       execMain.o execParallel.o execProcnode.o execQual.o \
+       execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
        nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
new file mode 100644
index 0000000..a409a9a
--- /dev/null
+++ b/src/backend/executor/execParallel.c
@@ -0,0 +1,585 @@
+/*-------------------------------------------------------------------------
+ *
+ * execParallel.c
+ *	  Support routines for parallel execution.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execParallel.h"
+#include "executor/executor.h"
+#include "executor/tqueue.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/planmain.h"
+#include "optimizer/planner.h"
+#include "storage/spin.h"
+#include "tcop/tcopprot.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+/*
+ * Magic numbers for parallel executor communication.  We use constants
+ * greater than any 32-bit integer here so that values < 2^32 can be used
+ * by individual parallel nodes to store their own state.
+ */
+#define PARALLEL_KEY_PLANNEDSTMT		UINT64CONST(0xE000000000000001)
+#define PARALLEL_KEY_PARAMS				UINT64CONST(0xE000000000000002)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xE000000000000003)
+#define PARALLEL_KEY_TUPLE_QUEUE		UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION	UINT64CONST(0xE000000000000005)
+
+#define PARALLEL_TUPLE_QUEUE_SIZE		65536
+
+/* DSM structure for accumulating per-PlanState instrumentation. */
+typedef struct SharedPlanStateInstrumentation
+{
+	int plan_node_id;
+	slock_t mutex;
+	Instrumentation	instr;
+} SharedPlanStateInstrumentation;
+
+/* DSM structure for accumulating per-PlanState instrumentation. */
+struct SharedExecutorInstrumentation
+{
+	int instrument_options;
+	int ps_ninstrument;			/* # of ps_instrument structures following */
+	SharedPlanStateInstrumentation ps_instrument[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/* Context object for ExecParallelEstimate. */
+typedef struct ExecParallelEstimateContext
+{
+	ParallelContext *pcxt;
+	int nnodes;
+} ExecParallelEstimateContext;
+
+/* Context object for ExecParallelEstimate. */
+typedef struct ExecParallelInitializeDSMContext
+{
+	ParallelContext *pcxt;
+	SharedExecutorInstrumentation *instrumentation;
+	int nnodes;
+} ExecParallelInitializeDSMContext;
+
+/* Helper functions that run in the parallel leader. */
+static char *ExecSerializePlan(Plan *plan, List *rangetable);
+static bool ExecParallelEstimate(PlanState *node,
+					 ExecParallelEstimateContext *e);
+static bool ExecParallelInitializeDSM(PlanState *node,
+					 ExecParallelInitializeDSMContext *d);
+static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt);
+static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
+						  SharedExecutorInstrumentation *instrumentation);
+
+/* Helper functions that run in the parallel worker. */
+static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
+static DestReceiver *ExecParallelGetReceiver(dsm_segment *seg, shm_toc *toc);
+
+/*
+ * Create a serialized representation of the plan to be sent to each worker.
+ */
+static char *
+ExecSerializePlan(Plan *plan, List *rangetable)
+{
+	PlannedStmt *pstmt;
+	ListCell   *tlist;
+
+	/* We can't scribble on the original plan, so make a copy. */
+	plan = copyObject(plan);
+
+	/*
+	 * The worker will start its own copy of the executor, and that copy will
+	 * insert a junk filter if the toplevel node has any resjunk entries. We
+	 * don't want that to happen, because while resjunk columns shouldn't be
+	 * sent back to the user, here the tuples are coming back to another
+	 * backend which may very well need them.  So mutate the target list
+	 * accordingly.  This is sort of a hack; there might be better ways to do
+	 * this...
+	 */
+	foreach(tlist, plan->targetlist)
+	{
+		TargetEntry *tle = (TargetEntry *) lfirst(tlist);
+
+		tle->resjunk = false;
+	}
+
+	/*
+	 * Create a dummy PlannedStmt.  Most of the fields don't need to be valid
+	 * for our purposes, but the worker will need at least a minimal
+	 * PlannedStmt to start the executor.
+	 */
+	pstmt = makeNode(PlannedStmt);
+	pstmt->commandType = CMD_SELECT;
+	pstmt->queryId = 0;
+	pstmt->hasReturning = 0;
+	pstmt->hasModifyingCTE = 0;
+	pstmt->canSetTag = 1;
+	pstmt->transientPlan = 0;
+	pstmt->planTree = plan;
+	pstmt->rtable = rangetable;
+	pstmt->resultRelations = NIL;
+	pstmt->utilityStmt = NULL;
+	pstmt->subplans = NIL;
+	pstmt->rewindPlanIDs = NULL;
+	pstmt->rowMarks = NIL;
+	pstmt->nParamExec = 0;
+	pstmt->relationOids = NIL;
+	pstmt->invalItems = NIL;	/* workers can't replan anyway... */
+	pstmt->hasRowSecurity = false;
+
+	/* Return serialized copy of our dummy PlannedStmt. */
+	return nodeToString(pstmt);
+}
+
+/*
+ * Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
+ * may need some state which is shared across all parallel workers.  Before
+ * we size the DSM, give them a chance to call shm_toc_estimate_chunk or
+ * shm_toc_estimate_keys on &pcxt->estimator.
+ *
+ * While we're at it, count the number of PlanState nodes in the tree, so
+ * we know how many SharedPlanStateInstrumentation structures we need.
+ */
+static bool
+ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
+{
+	if (planstate == NULL)
+		return false;
+
+	/* Count this node. */
+	e->nnodes++;
+
+	/*
+	 * XXX. Call estimators for parallel-aware nodes here, when we have
+	 * some.
+	 */
+
+	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
+}
+
+/*
+ * Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
+ * may need to initialize shared state in the DSM before parallel workers
+ * are available.  They can allocate the space they previous estimated using
+ * shm_toc_allocate, and add the keys they previously estimated using
+ * shm_toc_insert, in each case targeting pcxt->toc.
+ */
+static bool
+ExecParallelInitializeDSM(PlanState *planstate,
+						  ExecParallelInitializeDSMContext *d)
+{
+	if (planstate == NULL)
+		return false;
+
+	/* If instrumentation is enabled, initialize array slot for this node. */
+	if (d->instrumentation != NULL)
+	{
+		SharedPlanStateInstrumentation *instrumentation;
+
+		instrumentation = &d->instrumentation->ps_instrument[d->nnodes];
+		Assert(d->nnodes < d->instrumentation->ps_ninstrument);
+		instrumentation->plan_node_id = planstate->plan->plan_node_id;
+		SpinLockInit(&instrumentation->mutex);
+		InstrInit(&instrumentation->instr,
+				  d->instrumentation->instrument_options);
+	}
+
+	/* Count this node. */
+	d->nnodes++;
+
+	/*
+	 * XXX. Call initializers for parallel-aware plan nodes, when we have
+	 * some.
+	 */
+
+	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
+}
+
+/*
+ * It sets up the response queues for backend workers to return tuples
+ * to the main backend and start the workers.
+ */
+static shm_mq_handle **
+ExecParallelSetupTupleQueues(ParallelContext *pcxt)
+{
+	shm_mq_handle **responseq;
+	char	   *tqueuespace;
+	int			i;
+
+	/* Skip this if no workers. */
+	if (pcxt->nworkers == 0)
+		return NULL;
+
+	/* Allocate memory for shared memory queue handles. */
+	responseq = (shm_mq_handle **)
+		palloc(pcxt->nworkers * sizeof(shm_mq_handle *));
+
+	/* Allocate space from the DSM for the queues themselves. */
+	tqueuespace = shm_toc_allocate(pcxt->toc,
+								 PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+
+	/* Create the queues, and become the receiver for each. */
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		shm_mq	   *mq;
+
+		mq = shm_mq_create(tqueuespace + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+
+		shm_mq_set_receiver(mq, MyProc);
+		responseq[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+
+	/* Add array of queues to shm_toc, so others can find it. */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tqueuespace);
+
+	/* Return array of handles. */
+	return responseq;
+}
+
+/*
+ * Sets up the required infrastructure for backend workers to perform
+ * execution and return results to the main backend.
+ */
+ParallelExecutorInfo *
+ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
+{
+	ParallelExecutorInfo *pei;
+	ParallelContext *pcxt;
+	ExecParallelEstimateContext e;
+	ExecParallelInitializeDSMContext d;
+	char	   *pstmt_data;
+	char	   *pstmt_space;
+	char	   *param_space;
+	BufferUsage *bufusage_space;
+	SharedExecutorInstrumentation *instrumentation = NULL;
+	int			pstmt_len;
+	int			param_len;
+	int			instrumentation_len = 0;
+
+	/* Allocate object for return value. */
+	pei = palloc0(sizeof(ParallelExecutorInfo));
+	pei->planstate = planstate;
+
+	/* Fix up and serialize plan to be sent to workers. */
+	pstmt_data = ExecSerializePlan(planstate->plan, estate->es_range_table);
+
+	/* Create a parallel context. */
+	pcxt = CreateParallelContext(ParallelQueryMain, nworkers);
+	pei->pcxt = pcxt;
+
+	/*
+	 * Before telling the parallel context to create a dynamic shared memory
+	 * segment, we need to figure out how big it should be.  Estimate space
+	 * for the various things we need to store.
+	 */
+
+	/* Estimate space for serialized PlannedStmt. */
+	pstmt_len = strlen(pstmt_data) + 1;
+	shm_toc_estimate_chunk(&pcxt->estimator, pstmt_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for serialized ParamListInfo. */
+	param_len = EstimateParamListSpace(estate->es_param_list_info);
+	shm_toc_estimate_chunk(&pcxt->estimator, param_len);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Estimate space for BufferUsage.
+	 *
+	 * If EXPLAIN is not in use and there are no extensions loaded that care,
+	 * we could skip this.  But we have no way of knowing whether anyone's
+	 * looking at pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Estimate space for tuple queues. */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Give parallel-aware nodes a chance to add to the estimates, and get
+	 * a count of how many PlanState nodes there are.
+	 */
+	e.pcxt = pcxt;
+	e.nnodes = 0;
+	ExecParallelEstimate(planstate, &e);
+
+	/* Estimate space for instrumentation, if required. */
+	if (estate->es_instrument)
+	{
+		instrumentation_len =
+			offsetof(SharedExecutorInstrumentation, ps_instrument)
+			+ sizeof(SharedPlanStateInstrumentation) * e.nnodes;
+		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
+	/* Everyone's had a chance to ask for space, so now create the DSM. */
+	InitializeParallelDSM(pcxt);
+
+	/*
+	 * OK, now we have a dynamic shared memory segment, and it should be big
+	 * enough to store all of the data we estimated we would want to put into
+	 * it, plus whatever general stuff (not specifically executor-related) the
+	 * ParallelContext itself needs to store there.  None of the space we
+	 * asked for has been allocated or initialized yet, though, so do that.
+	 */
+
+	/* Store serialized PlannedStmt. */
+	pstmt_space = shm_toc_allocate(pcxt->toc, pstmt_len);
+	memcpy(pstmt_space, pstmt_data, pstmt_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PLANNEDSTMT, pstmt_space);
+
+	/* Store serialized ParamListInfo. */
+	param_space = shm_toc_allocate(pcxt->toc, param_len);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMS, param_space);
+	SerializeParamList(estate->es_param_list_info, &param_space);
+
+	/* Allocate space for each worker's BufferUsage; no need to initialize. */
+	bufusage_space = shm_toc_allocate(pcxt->toc,
+									  sizeof(BufferUsage) * pcxt->nworkers);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
+	pei->buffer_usage = bufusage_space;
+
+	/* Set up tuple queues. */
+	pei->tqueue = ExecParallelSetupTupleQueues(pcxt);
+
+	/*
+	 * If instrumentation options were supplied, allocate space for the
+	 * data.  It only gets partially initialized here; the rest happens
+	 * during ExecParallelInitializeDSM.
+	 */
+	if (estate->es_instrument)
+	{
+		instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
+		instrumentation->instrument_options = estate->es_instrument;
+		instrumentation->ps_ninstrument = e.nnodes;
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+					   instrumentation);
+		pei->instrumentation = instrumentation;
+	}
+
+	/*
+	 * Give parallel-aware nodes a chance to initialize their shared data.
+	 * This also initializes the elements of instrumentation->ps_instrument,
+	 * if it exists.
+	 */
+	d.pcxt = pcxt;
+	d.instrumentation = instrumentation;
+	d.nnodes = 0;
+	ExecParallelInitializeDSM(planstate, &d);
+
+	/*
+	 * Make sure that the world hasn't shifted under our feat.  This could
+	 * probably just be an Assert(), but let's be conservative for now.
+	 */
+	if (e.nnodes != d.nnodes)
+		elog(ERROR, "inconsistent count of PlanState nodes");
+
+	/* OK, we're ready to rock and roll. */
+	return pei;
+}
+
+/*
+ * Copy instrumentation information about this node and its descendents from
+ * dynamic shared memory.
+ */
+static bool
+ExecParallelRetrieveInstrumentation(PlanState *planstate,
+						  SharedExecutorInstrumentation *instrumentation)
+{
+	int		i;
+	int		plan_node_id = planstate->plan->plan_node_id;
+	SharedPlanStateInstrumentation *ps_instrument;
+
+	/* Find the instumentation for this node. */
+	for (i = 0; i < instrumentation->ps_ninstrument; ++i)
+		if (instrumentation->ps_instrument[i].plan_node_id == plan_node_id)
+			break;
+	if (i >= instrumentation->ps_ninstrument)
+		elog(ERROR, "plan node %d not found", plan_node_id);
+
+	/* No need to acquire the spinlock here; workers have exited already. */
+	ps_instrument = &instrumentation->ps_instrument[i];
+	InstrAggNode(planstate->instrument, &ps_instrument->instr);
+
+	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
+								 instrumentation);
+}
+
+/*
+ * Finish parallel execution.  We wait for parallel workers to finish, and
+ * accumulate their buffer usage and instrumentation.
+ */
+void
+ExecParallelFinish(ParallelExecutorInfo *pei)
+{
+	int		i;
+
+	/* First, wait for the workers to finish. */
+	WaitForParallelWorkersToFinish(pei->pcxt);
+
+	/* Next, accumulate buffer usage. */
+	for (i = 0; i < pei->pcxt->nworkers; ++i)
+		InstrAccumParallelQuery(&pei->buffer_usage[i]);
+
+	/* Finally, accumulate instrumentation, if any. */
+	if (pei->instrumentation)
+		ExecParallelRetrieveInstrumentation(pei->planstate,
+											pei->instrumentation);
+}
+
+/*
+ * Create a DestReceiver to write tuples we produce to the shm_mq designated
+ * for that purpose.
+ */
+static DestReceiver *
+ExecParallelGetReceiver(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *mqspace;
+	shm_mq	   *mq;
+
+	mqspace = shm_toc_lookup(toc, PARALLEL_KEY_TUPLE_QUEUE);
+	mqspace += ParallelWorkerNumber * PARALLEL_TUPLE_QUEUE_SIZE;
+	mq = (shm_mq *) mqspace;
+	shm_mq_set_sender(mq, MyProc);
+	return CreateTupleQueueDestReceiver(shm_mq_attach(mq, seg, NULL));
+}
+
+/*
+ * Create a QueryDesc for the PlannedStmt we are to execute, and return it.
+ */
+static QueryDesc *
+ExecParallelGetQueryDesc(shm_toc *toc, DestReceiver *receiver,
+						 int instrument_options)
+{
+	char	   *pstmtspace;
+	char	   *paramspace;
+	PlannedStmt *pstmt;
+	ParamListInfo paramLI;
+
+	/* Reconstruct leader-supplied PlannedStmt. */
+	pstmtspace = shm_toc_lookup(toc, PARALLEL_KEY_PLANNEDSTMT);
+	pstmt = (PlannedStmt *) stringToNode(pstmtspace);
+
+	/* Reconstruct ParamListInfo. */
+	paramspace = shm_toc_lookup(toc, PARALLEL_KEY_PARAMS);
+	paramLI = RestoreParamList(&paramspace);
+
+	/*
+	 * Create a QueryDesc for the query.
+	 *
+	 * It's not obvious how to obtain the query string from here; and even if
+	 * we could copying it would take more cycles than not copying it. But
+	 * it's a bit unsatisfying to just use a dummy string here, so consider
+	 * revising this someday.
+	 */
+	return CreateQueryDesc(pstmt,
+						   "<parallel query>",
+						   GetActiveSnapshot(), InvalidSnapshot,
+						   receiver, paramLI, instrument_options);
+}
+
+/*
+ * Copy instrumentation information from this node and its descendents into
+ * dynamic shared memory, so that the parallel leader can retrieve it.
+ */
+static bool
+ExecParallelReportInstrumentation(PlanState *planstate,
+						  SharedExecutorInstrumentation *instrumentation)
+{
+	int		i;
+	int		plan_node_id = planstate->plan->plan_node_id;
+	SharedPlanStateInstrumentation *ps_instrument;
+
+	/*
+	 * If we shuffled the plan_node_id values in ps_instrument into sorted
+	 * order, we could use binary search here.  This might matter someday
+	 * if we're pushing down sufficiently large plan trees.  For now, do it
+	 * the slow, dumb way.
+	 */
+	for (i = 0; i < instrumentation->ps_ninstrument; ++i)
+		if (instrumentation->ps_instrument[i].plan_node_id == plan_node_id)
+			break;
+	if (i >= instrumentation->ps_ninstrument)
+		elog(ERROR, "plan node %d not found", plan_node_id);
+
+	/*
+	 * There's one SharedPlanStateInstrumentation per plan_node_id, so we
+	 * must use a spinlock in case multiple workers report at the same time.
+	 */
+	ps_instrument = &instrumentation->ps_instrument[i];
+	SpinLockAcquire(&ps_instrument->mutex);
+	InstrAggNode(&ps_instrument->instr, planstate->instrument);
+	SpinLockRelease(&ps_instrument->mutex);
+
+	return planstate_tree_walker(planstate, ExecParallelReportInstrumentation,
+								 instrumentation);
+}
+
+/*
+ * Main entrypoint for parallel query worker processes.
+ *
+ * We reach this function from ParallelMain, so the setup necessary to create
+ * a sensible parallel environment has already been done; ParallelMain worries
+ * about stuff like the transaction state, combo CID mappings, and GUC values,
+ * so we don't need to deal with any of that here.
+ *
+ * Our job is to deal with concerns specific to the executor.  The parallel
+ * group leader will have stored a serialized PlannedStmt, and it's our job
+ * to execute that plan and write the resulting tuples to the appropriate
+ * tuple queue.  Various bits of supporting information that we need in order
+ * to do this are also stored in the dsm_segment and can be accessed through
+ * the shm_toc.
+ */
+static void
+ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
+{
+	BufferUsage *buffer_usage;
+	DestReceiver *receiver;
+	QueryDesc  *queryDesc;
+	SharedExecutorInstrumentation *instrumentation;
+	int			instrument_options = 0;
+
+	/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
+	receiver = ExecParallelGetReceiver(seg, toc);
+	instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION);
+	if (instrumentation != NULL)
+		instrument_options = instrumentation->instrument_options;
+	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
+
+	/* Prepare to track buffer usage during query execution. */
+	InstrStartParallelQuery();
+
+	/* Start up the executor, have it run the plan, and then shut it down. */
+	ExecutorStart(queryDesc, 0);
+	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
+	ExecutorFinish(queryDesc);
+	ExecutorEnd(queryDesc);
+
+	/* Report buffer usage during parallel execution. */
+	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE);
+	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber]);
+
+	/* Report instrumentation data if any instrumentation options are set. */
+	if (instrumentation != NULL)
+		ExecParallelReportInstrumentation(queryDesc->planstate,
+										  instrumentation);
+
+	/* Cleanup. */
+	FreeQueryDesc(queryDesc);
+	(*receiver->rDestroy) (receiver);
+}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..bf509b1 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -18,7 +18,9 @@
 #include "executor/instrument.h"
 
 BufferUsage pgBufferUsage;
+static BufferUsage save_pgBufferUsage;
 
+static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void BufferUsageAccumDiff(BufferUsage *dst,
 					 const BufferUsage *add, const BufferUsage *sub);
 
@@ -47,6 +49,15 @@ InstrAlloc(int n, int instrument_options)
 	return instr;
 }
 
+/* Initialize an pre-allocated instrumentation structure. */
+void
+InstrInit(Instrumentation *instr, int instrument_options)
+{
+	memset(instr, 0, sizeof(Instrumentation));
+	instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+	instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+}
+
 /* Entry to a plan node */
 void
 InstrStartNode(Instrumentation *instr)
@@ -127,6 +138,73 @@ InstrEndLoop(Instrumentation *instr)
 	instr->tuplecount = 0;
 }
 
+/* aggregate instrumentation information */
+void
+InstrAggNode(Instrumentation *dst, Instrumentation *add)
+{
+	if (!dst->running && add->running)
+	{
+		dst->running = true;
+		dst->firsttuple = add->firsttuple;
+	}
+	else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
+		dst->firsttuple = add->firsttuple;
+
+	INSTR_TIME_ADD(dst->counter, add->counter);
+
+	dst->tuplecount += add->tuplecount;
+	dst->startup += add->startup;
+	dst->total += add->total;
+	dst->ntuples += add->ntuples;
+	dst->nloops += add->nloops;
+	dst->nfiltered1 += add->nfiltered1;
+	dst->nfiltered2 += add->nfiltered2;
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (dst->need_bufusage)
+		BufferUsageAdd(&dst->bufusage, &add->bufusage);
+}
+
+/* note current values during parallel executor startup */
+void
+InstrStartParallelQuery(void)
+{
+	save_pgBufferUsage = pgBufferUsage;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(BufferUsage *result)
+{
+	memset(result, 0, sizeof(BufferUsage));
+	BufferUsageAccumDiff(result, &pgBufferUsage, &save_pgBufferUsage);
+}
+
+/* accumulate work done by workers in leader's stats */
+void
+InstrAccumParallelQuery(BufferUsage *result)
+{
+	BufferUsageAdd(&pgBufferUsage, result);
+}
+
+/* dst += add */
+static void
+BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
+{
+	dst->shared_blks_hit += add->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read;
+	dst->shared_blks_dirtied += add->shared_blks_dirtied;
+	dst->shared_blks_written += add->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read;
+	dst->local_blks_dirtied += add->local_blks_dirtied;
+	dst->local_blks_written += add->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written;
+	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
+	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
+}
+
 /* dst += add - sub */
 static void
 BufferUsageAccumDiff(BufferUsage *dst,
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index d0edf4e..67143d3 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -66,7 +66,9 @@ tqueueStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
 static void
 tqueueShutdownReceiver(DestReceiver *self)
 {
-	/* do nothing */
+	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
+
+	shm_mq_detach(shm_mq_get_queue(tqueue->handle));
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 62355aa..4b4ddec 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -112,6 +112,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
 	COPY_SCALAR_FIELD(total_cost);
 	COPY_SCALAR_FIELD(plan_rows);
 	COPY_SCALAR_FIELD(plan_width);
+	COPY_SCALAR_FIELD(plan_node_id);
 	COPY_NODE_FIELD(targetlist);
 	COPY_NODE_FIELD(qual);
 	COPY_NODE_FIELD(lefttree);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c91273c..ee9c360 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -271,6 +271,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_FLOAT_FIELD(plan_rows, "%.0f");
 	WRITE_INT_FIELD(plan_width);
+	WRITE_INT_FIELD(plan_node_id);
 	WRITE_NODE_FIELD(targetlist);
 	WRITE_NODE_FIELD(qual);
 	WRITE_NODE_FIELD(lefttree);
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index fb803f8..d093263 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "nodes/params.h"
+#include "storage/shmem.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 
@@ -73,3 +74,157 @@ copyParamList(ParamListInfo from)
 
 	return retval;
 }
+
+/*
+ * Estimate the amount of space required to serialize a ParamListInfo.
+ */
+Size
+EstimateParamListSpace(ParamListInfo paramLI)
+{
+	int		i;
+	Size	sz = sizeof(int);
+
+	if (paramLI == NULL || paramLI->numParams <= 0)
+		return sz;
+
+	for (i = 0; i < paramLI->numParams; i++)
+	{
+		ParamExternData *prm = &paramLI->params[i];
+		int16		typLen;
+		bool		typByVal;
+
+		/* give hook a chance in case parameter is dynamic */
+		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+			(*paramLI->paramFetch) (paramLI, i + 1);
+
+		sz = add_size(sz, sizeof(Oid));			/* space for type OID */
+		sz = add_size(sz, sizeof(uint16));		/* space for pflags */
+
+		/* space for datum/isnull */
+		if (OidIsValid(prm->ptype))
+			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		else
+		{
+			/* If no type OID, assume by-value, like copyParamList does. */
+			typLen = sizeof(Datum);
+			typByVal = true;
+		}
+		sz = add_size(sz,
+			datumEstimateSpace(prm->value, prm->isnull, typByVal, typLen));
+	}
+
+	return sz;
+}
+
+/*
+ * Serialize a paramListInfo structure into caller-provided storage.
+ *
+ * We write the number of parameters first, as a 4-byte integer, and then
+ * write details for each parameter in turn.  The details for each parameter
+ * consist of a 4-byte type OID, 2 bytes of flags, and then the datum as
+ * serialized by datumSerialize().  The caller is responsible for ensuring
+ * that there is enough storage to store the number of bytes that will be
+ * written; use EstimateParamListSpace to find out how many will be needed.
+ * *start_address is updated to point to the byte immediately following those
+ * written.
+ *
+ * RestoreParamList can be used to recreate a ParamListInfo based on the
+ * serialized representation; this will be a static, self-contained copy
+ * just as copyParamList would create.
+ */
+void
+SerializeParamList(ParamListInfo paramLI, char **start_address)
+{
+	int			nparams;
+	int			i;
+
+	/* Write number of parameters. */
+	if (paramLI == NULL || paramLI->numParams <= 0)
+		nparams = 0;
+	else
+		nparams = paramLI->numParams;
+	memcpy(*start_address, &nparams, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* Write each parameter in turn. */
+	for (i = 0; i < nparams; i++)
+	{
+		ParamExternData *prm = &paramLI->params[i];
+		int16		typLen;
+		bool		typByVal;
+
+		/* give hook a chance in case parameter is dynamic */
+		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+			(*paramLI->paramFetch) (paramLI, i + 1);
+
+		/* Write type OID. */
+		memcpy(*start_address, &prm->ptype, sizeof(Oid));
+		*start_address += sizeof(Oid);
+
+		/* Write flags. */
+		memcpy(*start_address, &prm->pflags, sizeof(uint16));
+		*start_address += sizeof(uint16);
+
+		/* Write datum/isnull. */
+		if (OidIsValid(prm->ptype))
+			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		else
+		{
+			/* If no type OID, assume by-value, like copyParamList does. */
+			typLen = sizeof(Datum);
+			typByVal = true;
+		}
+		datumSerialize(prm->value, prm->isnull, typByVal, typLen,
+					   start_address);
+	}
+}
+
+/*
+ * Copy a ParamListInfo structure.
+ *
+ * The result is allocated in CurrentMemoryContext.
+ *
+ * Note: the intent of this function is to make a static, self-contained
+ * set of parameter values.  If dynamic parameter hooks are present, we
+ * intentionally do not copy them into the result.  Rather, we forcibly
+ * instantiate all available parameter values and copy the datum values.
+ */
+ParamListInfo
+RestoreParamList(char **start_address)
+{
+	ParamListInfo paramLI;
+	Size		size;
+	int			i;
+	int			nparams;
+
+	memcpy(&nparams, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+
+	size = offsetof(ParamListInfoData, params) +
+		nparams * sizeof(ParamExternData);
+
+	paramLI = (ParamListInfo) palloc(size);
+	paramLI->paramFetch = NULL;
+	paramLI->paramFetchArg = NULL;
+	paramLI->parserSetup = NULL;
+	paramLI->parserSetupArg = NULL;
+	paramLI->numParams = nparams;
+
+	for (i = 0; i < nparams; i++)
+	{
+		ParamExternData *prm = &paramLI->params[i];
+
+		/* Read type OID. */
+		memcpy(&prm->ptype, *start_address, sizeof(Oid));
+		*start_address += sizeof(Oid);
+
+		/* Read flags. */
+		memcpy(&prm->pflags, *start_address, sizeof(uint16));
+		*start_address += sizeof(uint16);
+
+		/* Read datum/isnull. */
+		prm->value = datumRestore(start_address, &prm->isnull);
+	}
+
+	return paramLI;
+}
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 08519ed..72368ab 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1413,6 +1413,7 @@ ReadCommonPlan(Plan *local_node)
 	READ_FLOAT_FIELD(total_cost);
 	READ_FLOAT_FIELD(plan_rows);
 	READ_INT_FIELD(plan_width);
+	READ_INT_FIELD(plan_node_id);
 	READ_NODE_FIELD(targetlist);
 	READ_NODE_FIELD(qual);
 	READ_NODE_FIELD(lefttree);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 06be922..e1ee67c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -196,6 +196,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->nParamExec = 0;
 	glob->lastPHId = 0;
 	glob->lastRowMarkId = 0;
+	glob->lastPlanNodeId = 0;
 	glob->transientPlan = false;
 	glob->hasRowSecurity = false;
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index daeb584..3c81697 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -174,6 +174,8 @@ static bool extract_query_dependencies_walker(Node *node,
  * Currently, relations and user-defined functions are the only types of
  * objects that are explicitly tracked this way.
  *
+ * 7. We assign every plan node in the tree a unique ID.
+ *
  * We also perform one final optimization step, which is to delete
  * SubqueryScan plan nodes that aren't doing anything useful (ie, have
  * no qual and a no-op targetlist).  The reason for doing this last is that
@@ -436,6 +438,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	if (plan == NULL)
 		return NULL;
 
+	/* Assign this node a unique ID. */
+	plan->plan_node_id = root->glob->lastPlanNodeId++;
+
 	/*
 	 * Plan-type-specific fixes
 	 */
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index e8af030..3d9e354 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -246,3 +246,121 @@ datumIsEqual(Datum value1, Datum value2, bool typByVal, int typLen)
 	}
 	return res;
 }
+
+/*-------------------------------------------------------------------------
+ * datumEstimateSpace
+ *
+ * Compute the amount of space that datumSerialize will require for a
+ * particular Datum.
+ *-------------------------------------------------------------------------
+ */
+Size
+datumEstimateSpace(Datum value, bool isnull, bool typByVal, int typLen)
+{
+	Size	sz = sizeof(int);
+
+	if (!isnull)
+	{
+		/* no need to use add_size, can't overflow */
+		if (typByVal)
+			sz += sizeof(Datum);
+		else
+			sz += datumGetSize(value, typByVal, typLen);
+	}
+
+	return sz;
+}
+
+/*-------------------------------------------------------------------------
+ * datumSerialize
+ *
+ * Serialize a possibly-NULL datum into caller-provided storage.
+ *
+ * The format is as follows: first, we write a 4-byte header word, which
+ * is either the length of a pass-by-reference datum, -1 for a
+ * pass-by-value datum, or -2 for a NULL.  If the value is NULL, nothing
+ * further is written.  If it is pass-by-value, sizeof(Datum) bytes
+ * follow.  Otherwise, the number of bytes indicated by the header word
+ * follow.  The caller is responsible for ensuring that there is enough
+ * storage to store the number of bytes that will be written; use
+ * datumEstimateSpace() to find out how many will be needed.
+ * *start_address is updated to point to the byte immediately following
+ * those written.
+ *-------------------------------------------------------------------------
+ */
+void
+datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
+			   char **start_address)
+{
+	int		header;
+
+	/* Write header word. */
+	if (isnull)
+		header = -2;
+	else if (typByVal)
+		header = -1;
+	else
+		header = datumGetSize(value, typByVal, typLen);
+	memcpy(*start_address, &header, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* If not null, write payload bytes. */
+	if (!isnull)
+	{
+		if (typByVal)
+		{
+			memcpy(*start_address, &value, sizeof(Datum));
+			*start_address += sizeof(Datum);
+		}
+		else
+		{
+			memcpy(*start_address, DatumGetPointer(value), header);
+			*start_address += header;
+		}
+	}
+}
+
+/*-------------------------------------------------------------------------
+ * datumRestore
+ *
+ * Restore a possibly-NULL datum previously serialized by datumSerialize.
+ * *start_address is updated according to the number of bytes consumed.
+ *-------------------------------------------------------------------------
+ */
+Datum
+datumRestore(char **start_address, bool *isnull)
+{
+	int		header;
+	void   *d;
+
+	/* Read header word. */
+	memcpy(&header, *start_address, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* If this datum is NULL, we can stop here. */
+	if (header == -2)
+	{
+		*isnull = true;
+		return (Datum) 0;
+	}
+
+	/* OK, datum is not null. */
+	*isnull = false;
+
+	/* If this datum is pass-by-value, sizeof(Datum) bytes follow. */
+	if (header == -1)
+	{
+		Datum		val;
+
+		memcpy(&val, *start_address, sizeof(Datum));
+		*start_address += sizeof(Datum);
+		return val;
+	}
+
+	/* Pass-by-reference case; copy indicated number of bytes. */
+	Assert(header > 0);
+	d = palloc(header);
+	memcpy(d, *start_address, header);
+	*start_address += header;
+	return PointerGetDatum(d);
+}
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
new file mode 100644
index 0000000..4fc797a
--- /dev/null
+++ b/src/include/executor/execParallel.h
@@ -0,0 +1,36 @@
+/*--------------------------------------------------------------------
+ * execParallel.h
+ *		POSTGRES parallel execution interface
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execParallel.h
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECPARALLEL_H
+#define EXECPARALLEL_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+#include "nodes/parsenodes.h"
+#include "nodes/plannodes.h"
+
+typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
+
+typedef struct ParallelExecutorInfo
+{
+	PlanState *planstate;
+	ParallelContext *pcxt;
+	BufferUsage *buffer_usage;
+	SharedExecutorInstrumentation *instrumentation;
+	shm_mq_handle **tqueue;
+}	ParallelExecutorInfo;
+
+extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
+					 EState *estate, int nworkers);
+extern void ExecParallelFinish(ParallelExecutorInfo *pei);
+
+#endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c9a2129..f28e56c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -66,8 +66,13 @@ typedef struct Instrumentation
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 
 extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern void InstrInit(Instrumentation *instr, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern void InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(BufferUsage *result);
+extern void InstrAccumParallelQuery(BufferUsage *result);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index a0f7dd0..83bebde 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -102,5 +102,8 @@ typedef struct ParamExecData
 
 /* Functions found in src/backend/nodes/params.c */
 extern ParamListInfo copyParamList(ParamListInfo from);
+extern Size EstimateParamListSpace(ParamListInfo paramLI);
+extern void SerializeParamList(ParamListInfo paramLI, char **start_address);
+extern ParamListInfo RestoreParamList(char **start_address);
 
 #endif   /* PARAMS_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index cc259f1..1e2d2bb 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -111,6 +111,7 @@ typedef struct Plan
 	/*
 	 * Common structural data for all Plan types.
 	 */
+	int			plan_node_id;	/* unique across entire final plan tree */
 	List	   *targetlist;		/* target list to be computed at this node */
 	List	   *qual;			/* implicitly-ANDed qual conditions */
 	struct Plan *lefttree;		/* input plan tree(s) */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 79bed33..961b5d1 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -99,6 +99,8 @@ typedef struct PlannerGlobal
 
 	Index		lastRowMarkId;	/* highest PlanRowMark ID assigned */
 
+	int			lastPlanNodeId;	/* highest plan node ID assigned */
+
 	bool		transientPlan;	/* redo plan when TransactionXmin changes? */
 
 	bool		hasRowSecurity; /* row security applied? */
diff --git a/src/include/utils/datum.h b/src/include/utils/datum.h
index c572f79..e9d4be5 100644
--- a/src/include/utils/datum.h
+++ b/src/include/utils/datum.h
@@ -46,4 +46,14 @@ extern Datum datumTransfer(Datum value, bool typByVal, int typLen);
 extern bool datumIsEqual(Datum value1, Datum value2,
 			 bool typByVal, int typLen);
 
+/*
+ * Serialize and restore datums so that we can transfer them to parallel
+ * workers.
+ */
+extern Size datumEstimateSpace(Datum value, bool isnull, bool typByVal,
+				   int typLen);
+extern void datumSerialize(Datum value, bool isnull, bool typByVal,
+			   int typLen, char **start_address);
+extern Datum datumRestore(char **start_address, bool *isnull);
+
 #endif   /* DATUM_H */
-- 
2.3.8 (Apple Git-58)

#367Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#366)
1 attachment(s)
Re: Parallel Seq Scan

On Sun, Sep 27, 2015 at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I intend to commit this patch (but not the crappy test code, of
course) pretty soon, and then I'm going to start working on the
portion of the patch that actually adds the Funnel node, which I think
you are working on renaming to Gather.

Attached patch is a rebased patch based on latest commit (d1b7c1ff)
for Gather node.

- I have to reorganize the defines in execParallel.h and .c. To keep
ParallelExecutorInfo, in GatherState node, we need to include execParallel.h
in execnodes.h which was creating compilation issues as execParallel.h
also includes execnodes.h, so for now I have defined ParallelExecutorInfo
in execnodes.h and instrumentation related structures in instrument.h.
- Renamed parallel_seqscan_degree to degree_of_parallelism
- Rename Funnel to Gather
- Removed PARAM_EXEC parameter handling code, I think we can do this
separately.

I have to work more on partial seq scan patch for rebasing it and handling
review comments for the same, so for now I am sending the first part of
patch (which included Gather node functionality and some general support
for parallel-query execution).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_gather_v19.patchapplication/octet-stream; name=parallel_seqscan_gather_v19.patchDownload
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..639451a 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -26,9 +26,9 @@
 
 static void printtup_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-static void printtup(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_20(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
 static void printtup_shutdown(DestReceiver *self);
 static void printtup_destroy(DestReceiver *self);
 
@@ -299,7 +299,7 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
  *		printtup --- print a tuple in protocol 3.0
  * ----------------
  */
-static void
+static bool
 printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -376,13 +376,15 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
  *		printtup_20 --- print a tuple in protocol 2.0
  * ----------------
  */
-static void
+static bool
 printtup_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -452,6 +454,8 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
@@ -528,7 +532,7 @@ debugStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		debugtup - print one tuple for an interactive backend
  * ----------------
  */
-void
+bool
 debugtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -553,6 +557,8 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
 		printatt((unsigned) i + 1, typeinfo->attrs[i], value);
 	}
 	printf("\t----\n");
+
+	return true;
 }
 
 /* ----------------
@@ -564,7 +570,7 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
  * This is largely same as printtup_20, except we use binary formatting.
  * ----------------
  */
-static void
+static bool
 printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -636,4 +642,6 @@ printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f409aa7..42d4a44 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4411,7 +4411,7 @@ copy_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * copy_dest_receive --- receive one tuple
  */
-static void
+static bool
 copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_copy    *myState = (DR_copy *) self;
@@ -4423,6 +4423,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 	/* And send the data */
 	CopyOneRowTo(cstate, InvalidOid, slot->tts_values, slot->tts_isnull);
 	myState->processed++;
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 41183f6..418b0f6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,7 +62,7 @@ typedef struct
 static ObjectAddress CreateAsReladdr = {InvalidOid, InvalidOid, 0};
 
 static void intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void intorel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool intorel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void intorel_shutdown(DestReceiver *self);
 static void intorel_destroy(DestReceiver *self);
 
@@ -482,7 +482,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * intorel_receive --- receive one tuple
  */
-static void
+static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
@@ -507,6 +507,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f0d9e94..7f14fd9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -20,6 +20,7 @@
 #include "commands/defrem.h"
 #include "commands/prepare.h"
 #include "executor/hashjoin.h"
+#include "executor/nodeGather.h"
 #include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
@@ -730,6 +731,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -853,6 +855,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_Gather:
+			pname = sname = "Gather";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1003,6 +1008,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1147,6 +1153,15 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	}
 
 	/*
+	 * Aggregate instrumentation information of all the backend workers for
+	 * Gather node.  Though we already accumulate this information when last
+	 * tuple is fetched from Gather node, this is to cover cases when we don't
+	 * fetch all tuples from a node such as for Limit node.
+	 */
+	if (es->analyze && IsA(plan, Gather))
+		DestroyParallelSetupAndAccumStats((GatherState *) planstate);
+
+	/*
 	 * We have to forcibly clean up the instrumentation state because we
 	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
 	 *
@@ -1276,6 +1291,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Gather:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+								   ((Gather *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2335,6 +2358,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5492e59..750a59c 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -56,7 +56,7 @@ typedef struct
 static int	matview_maintenance_depth = 0;
 
 static void transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static void refresh_matview_datafill(DestReceiver *dest, Query *query,
@@ -422,7 +422,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * transientrel_receive --- receive one tuple
  */
-static void
+static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
@@ -441,6 +441,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f5e1e1a..51edd4c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -17,8 +17,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeGather.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 93e1e9a..163650c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeGather.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -160,6 +161,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_GatherState:
+			ExecReScanGather((GatherState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -467,6 +472,9 @@ ExecSupportsBackwardScan(Plan *node)
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
 
+		case T_Gather:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..fd89204 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_GatherState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 85ff46b..66e015b 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -45,9 +45,11 @@
 #include "commands/matview.h"
 #include "commands/trigger.h"
 #include "executor/execdebug.h"
+#include "executor/execParallel.h"
 #include "foreign/fdwapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -354,7 +356,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		(*dest->rShutdown) (dest);
 
 	if (queryDesc->totaltime)
+	{
+		/* Accumulate stats from parallel workers before stopping node */
+		(void) ExecParallelBufferUsageAccum((Node *) queryDesc->planstate);
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -1581,7 +1587,15 @@ ExecutePlan(EState *estate,
 		 * practice, this is probably always the case at this point.)
 		 */
 		if (sendTuples)
-			(*dest->receiveSlot) (slot, dest);
+		{
+			/*
+			 * If we are not able to send the tuple, we assume the destination
+			 * has closed and no more tuples can be sent. If that's the case,
+			 * end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
+		}
 
 		/*
 		 * Count tuples processed, if this is a SELECT.  (For other operation
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a409a9a..3a3e0de 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -17,6 +17,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodeGather.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -39,22 +40,6 @@
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
-/* DSM structure for accumulating per-PlanState instrumentation. */
-typedef struct SharedPlanStateInstrumentation
-{
-	int plan_node_id;
-	slock_t mutex;
-	Instrumentation	instr;
-} SharedPlanStateInstrumentation;
-
-/* DSM structure for accumulating per-PlanState instrumentation. */
-struct SharedExecutorInstrumentation
-{
-	int instrument_options;
-	int ps_ninstrument;			/* # of ps_instrument structures following */
-	SharedPlanStateInstrumentation ps_instrument[FLEXIBLE_ARRAY_MEMBER];
-};
-
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
 {
@@ -531,6 +516,33 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * ExecParallelBufferUsageAccum
+ *
+ * Recursively accumulate the stats for all the Gather nodes in a plan
+ * state tree.
+ */
+bool
+ExecParallelBufferUsageAccum(Node *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_GatherState:
+			{
+				DestroyParallelSetupAndAccumStats((GatherState *) node);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker((PlanState *) node, ExecParallelBufferUsageAccum, NULL);
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 03c2feb..f24d39e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -196,6 +197,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_Gather:
+			result = (PlanState *) ExecInitGather((Gather *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -416,6 +422,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_GatherState:
+			result = ExecGather((GatherState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -658,6 +668,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_GatherState:
+			ExecEndGather((GatherState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a05d8b1..d5619bd 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1313,7 +1313,7 @@ do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull)
 	ExecStoreVirtualTuple(slot);
 
 	/* send the tuple to the receiver */
-	(*tstate->dest->receiveSlot) (slot, tstate->dest);
+	(void) (*tstate->dest->receiveSlot) (slot, tstate->dest);
 
 	/* clean up */
 	ExecClearTuple(slot);
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..863bd64 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -167,7 +167,7 @@ static Datum postquel_get_single_result(TupleTableSlot *slot,
 static void sql_exec_error_callback(void *arg);
 static void ShutdownSQLFunction(Datum arg);
 static void sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
 static void sqlfunction_shutdown(DestReceiver *self);
 static void sqlfunction_destroy(DestReceiver *self);
 
@@ -1903,7 +1903,7 @@ sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * sqlfunction_receive --- receive one tuple
  */
-static void
+static bool
 sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_sqlfunction *myState = (DR_sqlfunction *) self;
@@ -1913,6 +1913,8 @@ sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Store the filtered tuple into the tuplestore */
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
new file mode 100644
index 0000000..9ec0474
--- /dev/null
+++ b/src/backend/executor/nodeGather.c
@@ -0,0 +1,313 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeGather.c
+ *	  Support routines for scanning a relation via multiple workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeGather.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecGather				scans a relation using worker backends.
+ *		ExecInitGather			creates and initializes a Gather node.
+ *		ExecEndGather			releases any storage allocated.
+ *		ExecReScanGather		Re-initialize the workers and rescans a relation via them.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodeGather.h"
+#include "executor/nodeSubplan.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *gather_getnext(GatherState *gatherstate);
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitGather
+ * ----------------------------------------------------------------
+ */
+GatherState *
+ExecInitGather(Gather *node, EState *estate, int eflags)
+{
+	GatherState *gatherstate;
+
+	/* Gather node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	gatherstate = makeNode(GatherState);
+	gatherstate->ss.ps.plan = (Plan *) node;
+	gatherstate->ss.ps.state = estate;
+	gatherstate->fs_workersReady = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &gatherstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	gatherstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+					 (PlanState *) gatherstate);
+	gatherstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->scan.plan.qual,
+					 (PlanState *) gatherstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &gatherstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &gatherstate->ss);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(gatherstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	gatherstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&gatherstate->ss.ps);
+	ExecAssignProjectionInfo(&gatherstate->ss.ps, NULL);
+
+	return gatherstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecGather(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecGather(GatherState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution. We do
+	 * this on first execution rather than during node initialization, as it
+	 * needs to allocate large dynamic segement, so it is better to do if it
+	 * is really needed.
+	 */
+	if (!node->pei->pcxt)
+	{
+		EState	   *estate = node->ss.ps.state;
+		bool		any_worker_launched = false;
+
+		/* Initialize the workers required to execute Gather node. */
+		node->pei = ExecInitParallelPlan(node->ss.ps.lefttree,
+										 estate,
+							   ((Gather *) (node->ss.ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pei->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are not
+		 * available then we perform the scan with available workers and if
+		 * there are no more workers available, then the Gather node will just
+		 * scan locally.
+		 */
+		LaunchParallelWorkers(node->pei->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pei->pcxt->nworkers; ++i)
+		{
+			if (node->pei->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->pei->tqueue)[i], node->pei->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->pei->tqueue)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->fs_workersReady = true;
+	}
+
+	slot = gather_getnext(node);
+
+	if (TupIsNull(slot))
+	{
+		/*
+		 * Destroy the parallel context once we complete fetching all the
+		 * tuples, this will ensure that if in the same statement we need to
+		 * have Gather node for multiple parts of statement, it won't
+		 * accumulate lot of dsm segments and workers can be made available to
+		 * use by other parts of statement.
+		 */
+		DestroyParallelSetupAndAccumStats(node);
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndGather
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndGather(GatherState *node)
+{
+	Relation	relation;
+
+	relation = node->ss.ss_currentRelation;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+	ExecEndNode(outerPlanState(node));
+
+	DestroyParallelSetupAndAccumStats(node);
+}
+
+/*
+ * gather_getnext
+ *
+ * Get the next tuple from shared memory queue.  This function
+ * is reponsible for fetching tuples from all the queues associated
+ * with worker backends used in Gather node execution and if there is
+ * no data available from queues or no worker is available, it does
+ * fetch the data from local node.
+ */
+TupleTableSlot *
+gather_getnext(GatherState *gatherstate)
+{
+	PlanState  *outerPlan;
+	TupleTableSlot *outerTupleSlot;
+	TupleTableSlot *slot;
+	HeapTuple	tup;
+
+	/*
+	 * We can use projection info of Gather for the tuples received from
+	 * worker backends as currently for all cases worker backends sends the
+	 * projected tuple as required by Gather node.
+	 */
+	slot = gatherstate->ss.ps.ps_ProjInfo->pi_slot;
+
+	while ((!gatherstate->all_workers_done && gatherstate->fs_workersReady) ||
+		   !gatherstate->local_scan_done)
+	{
+		if (!gatherstate->all_workers_done && gatherstate->fs_workersReady)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(gatherstate->funnel,
+									   !gatherstate->local_scan_done,
+									   &gatherstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer,	/* buffer associated with this
+												 * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!gatherstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(gatherstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			gatherstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *		DestroyParallelSetupAndAccumStats
+ *
+ *		Destroy the setup for parallel workers.  Collect all the
+ *		stats after workers are stopped, else some work done by
+ *		workers won't be accounted.
+ * ----------------------------------------------------------------
+ */
+void
+DestroyParallelSetupAndAccumStats(GatherState *node)
+{
+	if (node->pei->pcxt)
+	{
+		/*
+		 * Ensure all workers have finished before destroying the parallel
+		 * context to ensure a clean exit.
+		 */
+		if (node->fs_workersReady)
+		{
+			DestroyTupleQueueFunnel(node->funnel);
+			node->funnel = NULL;
+		}
+
+		ExecParallelFinish(node->pei);
+
+		/* destroy parallel context. */
+		DestroyParallelContext(node->pei->pcxt);
+		node->pei->pcxt = NULL;
+
+		node->fs_workersReady = false;
+		node->all_workers_done = false;
+		node->local_scan_done = false;
+	}
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanGather
+ *
+ *		Re-initialize the workers and rescans a relation via them.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanGather(GatherState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform rescan of
+	 * relation.  We want to gracefully shutdown all the workers so that they
+	 * should be able to propagate any error or other information to master
+	 * backend before dying.
+	 */
+	DestroyParallelSetupAndAccumStats(node);
+
+	ExecReScan(node->ss.ps.lefttree);
+}
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 300401e..a60f228 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1774,7 +1774,7 @@ spi_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		store tuple retrieved by Executor into SPITupleTable
  *		of current SPI procedure
  */
-void
+bool
 spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	SPITupleTable *tuptable;
@@ -1809,6 +1809,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 	(tuptable->free)--;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 67143d3..28ba49f 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -41,14 +41,24 @@ struct TupleQueueFunnel
 /*
  * Receive a tuple.
  */
-static void
+static bool
 tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 {
 	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
 	HeapTuple	tuple;
+	shm_mq_result result;
 
 	tuple = ExecMaterializeSlot(slot);
-	shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result == SHM_MQ_DETACHED)
+		return false;
+	else if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tstoreReceiver.c b/src/backend/executor/tstoreReceiver.c
index c1fdeb7..b0862ae 100644
--- a/src/backend/executor/tstoreReceiver.c
+++ b/src/backend/executor/tstoreReceiver.c
@@ -37,8 +37,8 @@ typedef struct
 } TStoreState;
 
 
-static void tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
-static void tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
 
 
 /*
@@ -90,19 +90,21 @@ tstoreStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the easy case where we don't have to detoast.
  */
-static void
+static bool
 tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
 
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the case where we have to detoast any toasted values.
  */
-static void
+static bool
 tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
@@ -152,6 +154,8 @@ tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 	/* And release any temporary detoasted values */
 	for (i = 0; i < nfree; i++)
 		pfree(DatumGetPointer(myState->tofree[i]));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 4b4ddec..f308063 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -383,6 +383,27 @@ _copySampleScan(const SampleScan *from)
 }
 
 /*
+ * _copyGather
+ */
+static Gather *
+_copyGather(const Gather *from)
+{
+	Gather	   *newnode = makeNode(Gather);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4241,6 +4262,9 @@ copyObject(const void *from)
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
+		case T_Gather:
+			retval = _copyGather(from);
+			break;
 		case T_IndexScan:
 			retval = _copyIndexScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ee9c360..bc1ba61 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -459,6 +459,16 @@ _outSampleScan(StringInfo str, const SampleScan *node)
 }
 
 static void
+_outGather(StringInfo str, const Gather *node)
+{
+	WRITE_NODE_TYPE("GATHER");
+
+	_outScanInfo(str, (const Scan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outIndexScan(StringInfo str, const IndexScan *node)
 {
 	WRITE_NODE_TYPE("INDEXSCAN");
@@ -3009,6 +3019,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
+			case T_Gather:
+				_outGather(str, obj);
+				break;
 			case T_IndexScan:
 				_outIndexScan(str, obj);
 				break;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d107d76..4bb3a48 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,8 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *	cpu_tuple_comm_cost Cost of CPU time to pass a tuple from worker to master backend
+ *	parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -102,11 +104,15 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		cpu_tuple_comm_cost = DEFAULT_CPU_TUPLE_COMM_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int			degree_of_parallelism = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -290,6 +296,42 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_gather
+ *	  Determines and returns the cost of gather path.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_gather(GatherPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Should only be applied to base relations */
+	Assert(baserel->relid > 0);
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = baserel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	run_cost += cpu_tuple_comm_cost * baserel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 404c6f5..183e77c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Gather *create_gather_plan(PlannerInfo *root,
+				   GatherPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -104,6 +106,9 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static Gather *make_gather(List *qptlist, List *qpqual,
+			Index scanrelid, int nworkers,
+			Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -273,6 +278,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 			plan = create_unique_plan(root,
 									  (UniquePath *) best_path);
 			break;
+		case T_Gather:
+			plan = (Plan *) create_gather_plan(root,
+											   (GatherPath *) best_path);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
 				 (int) best_path->pathtype);
@@ -560,6 +569,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1194,6 +1204,66 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_gather_plan
+ *
+ * Returns a gather plan for the base relation scanned by
+ * 'best_path'.
+ */
+static Gather *
+create_gather_plan(PlannerInfo *root, GatherPath *best_path)
+{
+	Gather	   *gather_plan;
+	Plan	   *subplan;
+	List	   *tlist;
+	RelOptInfo *rel = best_path->path.parent;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/*
+	 * For table scans, rather than using the relation targetlist (which is
+	 * only those Vars actually needed by the query), we prefer to generate a
+	 * tlist containing all Vars in order.  This will allow the executor to
+	 * optimize away projection of the table tuples, if possible.  (Note that
+	 * planner.c may replace the tlist we generate here, forcing projection to
+	 * occur.)
+	 */
+	if (use_physical_tlist(root, rel))
+	{
+		tlist = build_physical_tlist(root, rel);
+		/* if fail because of dropped cols, use regular method */
+		if (tlist == NIL)
+			tlist = build_path_tlist(root, &best_path->path);
+	}
+	else
+	{
+		tlist = build_path_tlist(root, &best_path->path);
+	}
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->path.parent->rtekind == RTE_RELATION);
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same as either all the quals
+	 * are pushed to subplan (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	gather_plan = make_gather(tlist,
+							  subplan->qual,
+							  scan_relid,
+							  best_path->num_workers,
+							  subplan);
+
+	copy_path_costsize(&gather_plan->scan.plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return gather_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3462,6 +3532,27 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static Gather *
+make_gather(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Gather	   *node = makeNode(Gather);
+	Plan	   *plan = &node->scan.plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->scan.scanrelid = scanrelid;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3c81697..97a4156 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -470,6 +470,26 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, (Node *) splan->tablesample, rtoffset);
 			}
 			break;
+		case T_Gather:
+			{
+				Gather	   *splan = (Gather *) plan;
+
+				/*
+				 * target list for leftree of gather plan should be same as
+				 * for gather scan as both nodes need to produce same
+				 * projection. We don't want to do this assignment after
+				 * fixing references as that will be done separately for
+				 * lefttree node.
+				 */
+				splan->scan.plan.lefttree->targetlist = splan->scan.plan.targetlist;
+
+				splan->scan.scanrelid += rtoffset;
+				splan->scan.plan.targetlist =
+					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+				splan->scan.plan.qual =
+					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d0bc412..78f3ce1 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2243,6 +2243,10 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
+		case T_Gather:
+			context.paramids = bms_add_members(context.paramids, scan_params);
+			break;
+
 		case T_IndexScan:
 			finalize_primnode((Node *) ((IndexScan *) plan)->indexqual,
 							  &context);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 935bc2b..18ef9dc 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -732,6 +732,32 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_gather_path
+ *
+ *	  Creates a path corresponding to a gather scan, returning the
+ *	  pathnode.
+ */
+GatherPath *
+create_gather_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
+				   Relids required_outer, int nworkers)
+{
+	GatherPath *pathnode = makeNode(GatherPath);
+
+	pathnode->path.pathtype = T_Gather;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+														  required_outer);
+	pathnode->path.pathkeys = NIL;		/* Gather has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nworkers;
+
+	cost_gather(pathnode, root, rel, pathnode->path.param_info);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index baa43b2..58fc6d3 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -103,6 +103,7 @@
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
+#include "optimizer/cost.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/fork_process.h"
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index d645751..bed1ef2 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -45,9 +45,10 @@
  *		dummy DestReceiver functions
  * ----------------
  */
-static void
+static bool
 donothingReceive(TupleTableSlot *slot, DestReceiver *self)
 {
+	return true;
 }
 
 static void
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d1f43c5..3781d81 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -42,6 +42,8 @@
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
+#include "executor/execParallel.h"
+#include "executor/tqueue.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 0df86a2..5eab231 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1117,7 +1117,13 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			if (!ok)
 				break;
 
-			(*dest->receiveSlot) (slot, dest);
+			/*
+			 * If we are not able to send the tuple, we assume the destination
+			 * has closed and no more tuples can be sent. If that's the case,
+			 * end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
 
 			ExecClearTuple(slot);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17053af..f14b24a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -588,6 +588,8 @@ const char *const config_group_names[] =
 	gettext_noop("Statistics / Query and Index Statistics Collector"),
 	/* AUTOVACUUM */
 	gettext_noop("Autovacuum"),
+	/* PARALLEL_QUERY */
+	gettext_noop("degree_of_parallelism"),
 	/* CLIENT_CONN */
 	gettext_noop("Client Connection Defaults"),
 	/* CLIENT_CONN_STATEMENT */
@@ -2535,6 +2537,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"degree_of_parallelism", PGC_SUSET, PARALLEL_QUERY,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&degree_of_parallelism,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2711,6 +2723,26 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"cpu_tuple_comm_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+				  "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&cpu_tuple_comm_cost,
+		DEFAULT_CPU_TUPLE_COMM_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+				  "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8c65287..16b574c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -290,6 +290,8 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#cpu_tuple_comm_cost = 0.1		# same scale as above
+#parallel_setup_cost = 1000.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
@@ -502,6 +504,11 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+#------------------------------------------------------------------------------
+# PARALLEL_QUERY PARAMETERS
+#------------------------------------------------------------------------------
+
+#degree_of_parallelism = 0		# max number of worker backend subprocesses
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
diff --git a/src/include/access/printtup.h b/src/include/access/printtup.h
index 46c4148..92ec882 100644
--- a/src/include/access/printtup.h
+++ b/src/include/access/printtup.h
@@ -25,11 +25,11 @@ extern void SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist,
 
 extern void debugStartup(DestReceiver *self, int operation,
 			 TupleDesc typeinfo);
-extern void debugtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool debugtup(TupleTableSlot *slot, DestReceiver *self);
 
 /* XXX these are really in executor/spi.c */
 extern void spi_dest_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-extern void spi_printtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool spi_printtup(TupleTableSlot *slot, DestReceiver *self);
 
 #endif   /* PRINTTUP_H */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 4fc797a..131df82 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -13,24 +13,12 @@
 #ifndef EXECPARALLEL_H
 #define EXECPARALLEL_H
 
-#include "access/parallel.h"
 #include "nodes/execnodes.h"
-#include "nodes/parsenodes.h"
-#include "nodes/plannodes.h"
 
-typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
-
-typedef struct ParallelExecutorInfo
-{
-	PlanState *planstate;
-	ParallelContext *pcxt;
-	BufferUsage *buffer_usage;
-	SharedExecutorInstrumentation *instrumentation;
-	shm_mq_handle **tqueue;
-}	ParallelExecutorInfo;
 
 extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
 					 EState *estate, int nworkers);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
+extern bool ExecParallelBufferUsageAccum(Node *node);
 
 #endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f28e56c..795564c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "storage/spin.h"
 
 
 typedef struct BufferUsage
@@ -63,6 +64,22 @@ typedef struct Instrumentation
 	BufferUsage bufusage;		/* Total buffer usage */
 } Instrumentation;
 
+/* DSM structure for accumulating per-PlanState instrumentation. */
+typedef struct SharedPlanStateInstrumentation
+{
+	int plan_node_id;
+	slock_t mutex;
+	Instrumentation	instr;
+} SharedPlanStateInstrumentation;
+
+/* DSM structure for accumulating per-PlanState instrumentation. */
+typedef struct SharedExecutorInstrumentation
+{
+	int instrument_options;
+	int ps_ninstrument;			/* # of ps_instrument structures following */
+	SharedPlanStateInstrumentation ps_instrument[FLEXIBLE_ARRAY_MEMBER];
+} SharedExecutorInstrumentation;
+
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 
 extern Instrumentation *InstrAlloc(int n, int instrument_options);
diff --git a/src/include/executor/nodeGather.h b/src/include/executor/nodeGather.h
new file mode 100644
index 0000000..fc99633
--- /dev/null
+++ b/src/include/executor/nodeGather.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeGather.h
+ *		prototypes for nodeGather.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeGather.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEGATHER_H
+#define NODEGATHER_H
+
+#include "nodes/execnodes.h"
+
+extern GatherState *ExecInitGather(Gather *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecGather(GatherState *node);
+extern void ExecEndGather(GatherState *node);
+extern void DestroyParallelSetupAndAccumStats(GatherState *node);
+extern void ExecReScanGather(GatherState *node);
+
+#endif   /* NODEGATHER_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ae2f3e..6a8c107 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -421,7 +423,6 @@ typedef struct EState
 	bool	   *es_epqScanDone; /* true if EPQ tuple has been fetched */
 } EState;
 
-
 /*
  * ExecRowMark -
  *	   runtime representation of FOR [KEY] UPDATE/SHARE clauses
@@ -1049,6 +1050,13 @@ typedef struct PlanState
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
 
 	/*
+	 * At execution time, parallel scan descriptor is initialized and stored
+	 * in dynamic shared memory segment by master backend and parallel workers
+	 * retrieve it from shared memory.
+	 */
+	shm_toc    *toc;
+
+	/*
 	 * Other run-time state needed by most if not all node types.
 	 */
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
@@ -1058,6 +1066,15 @@ typedef struct PlanState
 								 * functions in targetlist */
 } PlanState;
 
+typedef struct ParallelExecutorInfo
+{
+	PlanState *planstate;
+	ParallelContext *pcxt;
+	BufferUsage *buffer_usage;
+	SharedExecutorInstrumentation *instrumentation;
+	shm_mq_handle **tqueue;
+} ParallelExecutorInfo;
+
 /* ----------------
  *	these are defined to avoid confusion problems with "left"
  *	and "right" and "inner" and "outer".  The convention is that
@@ -1273,6 +1290,27 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * GatherState extends ScanState by storing additional information
+ * related to parallel workers.
+ *		ParallelExecutorInfo	parallel execution info for managing generic state information
+ *							required for parallelism.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		fs_workersReady		indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ */
+typedef struct GatherState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelExecutorInfo *pei;
+	TupleQueueFunnel *funnel;
+	bool		fs_workersReady;
+	bool		all_workers_done;
+	bool		local_scan_done;
+} GatherState;
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 274480e..c014532 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_Gather,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_GatherState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -223,6 +225,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_GatherPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 1e2d2bb..d1fea12 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -297,6 +297,16 @@ typedef struct SampleScan
 	struct TableSampleClause *tablesample;
 } SampleScan;
 
+/* ------------
+ *		Gather node
+ * ------------
+ */
+typedef struct Gather
+{
+	Scan		scan;
+	int			num_workers;
+} Gather;
+
 /* ----------------
  *		index scan node
  *
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 961b5d1..20192c9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -763,6 +763,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct GatherPath
+{
+	Path		path;
+	Path	   *subpath;		/* path for each worker */
+	int			num_workers;
+} GatherPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index dd43e45..7d536dc 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,13 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_CPU_TUPLE_COMM_COST 0.1
+/*
+ * XXX - We have kept reasonably high value for default parallel
+ * setup cost. In future we might want to change this value based
+ * on results.
+ */
+#define DEFAULT_PARALLEL_SETUP_COST  1000.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +55,11 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double cpu_tuple_comm_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	degree_of_parallelism;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -70,6 +80,8 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_gather(GatherPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 161644c..cc00ba5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,9 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern GatherPath *create_gather_path(PlannerInfo *root,
+				   RelOptInfo *rel, Path *subpath, Relids required_outer,
+				   int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index b560672..91acd60 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -104,7 +104,9 @@ typedef enum
  *		pointers that the executor must call.
  *
  * Note: the receiveSlot routine must be passed a slot containing a TupleDesc
- * identical to the one given to the rStartup routine.
+ * identical to the one given to the rStartup routine.  It returns bool where
+ * a "true" value means "continue processing" and a "false" value means
+ * "stop early, just as if we'd reached the end of the scan".
  * ----------------
  */
 typedef struct _DestReceiver DestReceiver;
@@ -112,7 +114,7 @@ typedef struct _DestReceiver DestReceiver;
 struct _DestReceiver
 {
 	/* Called for each tuple to be output: */
-	void		(*receiveSlot) (TupleTableSlot *slot,
+	bool		(*receiveSlot) (TupleTableSlot *slot,
 											DestReceiver *self);
 	/* Per-executor-run initialization and shutdown: */
 	void		(*rStartup) (DestReceiver *self,
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index 96c5b8b..8ae7a16 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -19,6 +19,7 @@
 #ifndef TCOPPROT_H
 #define TCOPPROT_H
 
+#include "executor/execParallel.h"
 #include "nodes/params.h"
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7a58ddb..3505d31 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -85,6 +85,7 @@ enum config_group
 	STATS_MONITORING,
 	STATS_COLLECTOR,
 	AUTOVACUUM,
+	PARALLEL_QUERY,
 	CLIENT_CONN,
 	CLIENT_CONN_STATEMENT,
 	CLIENT_CONN_LOCALE,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0e149ea..6e1456b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -707,6 +707,9 @@ FunctionParameterMode
 FunctionScan
 FunctionScanPerFuncState
 FunctionScanState
+Gather
+GatherPath
+GatherState
 FuzzyAttrMatchState
 GBT_NUMKEY
 GBT_NUMKEY_R
#368Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#367)
Re: Parallel Seq Scan

On Tue, Sep 29, 2015 at 12:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch is a rebased patch based on latest commit (d1b7c1ff)
for Gather node.

- I have to reorganize the defines in execParallel.h and .c. To keep
ParallelExecutorInfo, in GatherState node, we need to include execParallel.h
in execnodes.h which was creating compilation issues as execParallel.h
also includes execnodes.h, so for now I have defined ParallelExecutorInfo
in execnodes.h and instrumentation related structures in instrument.h.
- Renamed parallel_seqscan_degree to degree_of_parallelism
- Rename Funnel to Gather
- Removed PARAM_EXEC parameter handling code, I think we can do this
separately.

I have to work more on partial seq scan patch for rebasing it and handling
review comments for the same, so for now I am sending the first part of
patch (which included Gather node functionality and some general support
for parallel-query execution).

Thanks for the fast rebase.

This patch needs a bunch of cleanup:

- The formatting for the GatherState node's comment block is unlike
that of surrounding comment blocks. It lacks the ------- dividers,
and the indentation is not the same. Also, it refers to
ParallelExecutorInfo by the type name, but the other members by
structure member name. The convention is to refer to them by
structure member name, so please do that.

- The naming of fs_workersReady is inconsistent with the other
structure members. The other members use all lower-case names,
separating words with dashes, but this one uses a capital letter. The
other members also don't prefix the names with anything, but this uses
a "fs_" prefix which I assume is left over from when this was called
FunnelState. Finally, this doesn't actually tell you when workers are
ready, just whether they were launched. I suggest we rename this to
"any_worker_launched".

- Instead of moving the declaration of ParallelExecutorInfo, please
just refer to it as "struct ParallelExecutorInfo" in execnodes.h.
That way, you're not sucking these includes into all kinds of places
they don't really need to be.

- Let's not create a new PARALLEL_QUERY category of GUC. Instead,
let's the GUC for the number of workers with under resource usage ->
asynchronous behavior.

- I don't believe that shm_toc *toc has any business being part of a
generic PlanState node. At most, it should be part of an individual
type of PlanState, like a GatherState or PartialSeqScanState. But
really, I don't see why we need it there at all. It should, I think,
only be needed during startup to dig out the information we need. So
we should just dig that stuff out and keep pointers to whatever we
actually need - in this case the ParallelExecutorInfo, I think - in
the particular type of PlanState node that's at issue - here
GatherState. After that we don't need a pointer to the toc any more.

- I'd like to do some renaming of the new GUCs. I suggest we rename
cpu_tuple_comm_cost to parallel_tuple_cost and degree_of_parallelism
to max_parallel_degree.

- I think that a Gather node should inherit from Plan, not Scan. A
Gather node really shouldn't have a scanrelid. Now, admittedly, if
the only thing under the Gather is a Partial Seq Scan, it wouldn't be
totally bonkers to think of the Gather as scanning the same relation
that the Partial Seq Scan is scanning. But in any more complex case,
like where it's scanning a join, you're out of luck. You'll have to
set scanrelid == 0, I suppose, but then, for example, ExecScanReScan
is not going to work. In fact, as far as I can see, the only way
nodeGather.c is actually using any of the generic scan stuff is by
calling ExecInitScanTupleSlot, which is all of one line of code.
ExecEndGather fetches node->ss.ss_currentRelation but then does
nothing with it. So I think this is just a holdover from early
version of this patch where what's now Gather and PartialSeqScan were
a single node, and I think we should rip it out.

- On a related note, the assertions in cost_gather() are both bogus
and should be removed. Similarly with create_gather_plan(). As
previously mentioned, the Gather node should not care what sort of
thing is under it; I am not interested in restricting it to baserels
and then undoing that later.

- For the same reason, cost_gather() should refer to it's third
argument as "rel" not "baserel".

- Also, I think this stuff about physical tlists in
create_gather_plan() is bogus. use_physical_tlist is ignorant of the
possibility that the RelOptInfo passed to it might be anything other
than a baserel, and I think it won't be happy if it gets a joinrel.
Moreover, I think our plan here is that, at least for now, the
Gather's tlist will always match the tlist of its child. If that's
so, there's no point to this: it will end up with the same tlist
either way. If any projection is needed, it should be done by the
Gather node's child, not the Gather node itself.

- Let's rename DestroyParallelSetupAndAccumStats to
ExecShutdownGather. Instead of encasing the entire function in if
statement, let's start with if (node->pei == NULL || node->pei->pcxt
== NULL) return.

- ExecParallelBufferUsageAccum should be declared to take an argument
of type PlanState, not Node. Then you don't have to cast what you are
passing to it, and it doesn't have to cast before calling itself. And,
let's also rename it to ExecShutdownNode and move it to
execProcnode.c. Having a "shutdown phase" that stops a node from
asynchronously consuming additional resources could be useful for
non-parallel node types - especially ForeignScan and CustomScan. And
we could eventually extend this to be called in other situations, like
when a Limit is filled give everything beneath it a chance to ease up.
We don't have to do those bits of work right now but it seems well
worth making this look like a generic facility.

- Calling DestroyParallelSetupAndAccumStats from ExplainNode when we
actually reach the Gather node is much too late. We should really be
shutting down parallel workers at the end of the ExecutorRun phase, or
certainly no later than ExecutorFinish. In fact, you have
standard_ExecutorRun calling ExecParallelBufferUsageAccum() but only
if queryDesc->totaltime is set. What I think you should do instead is
call ExecShutdownNode a few lines earlier, before shutting down the
tuple receiver, and do so unconditionally. That way, the workers are
always shut down in the ExecutorRun phase, which should eliminate the
need for this bit in explain.c.

- The changes to postmaster.c and postgres.c consist of only
additional #includes. Those can, presumably, be reverted.

Other than that, hah hah, it looks pretty cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#369Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#352)
Re: Parallel Seq Scan

On Thu, Sep 24, 2015 at 2:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ parallel_seqscan_partialseqscan_v18.patch ]

I spent a bit of time reviewing the heapam.c changes in this patch
this evening, and I think that your attempt to add support for
synchronized scans has some problems.

- In both heapgettup() and heapgettup_pagemode(), you call
ss_report_location() on discovering that we're trying to initialize
after the scan is already complete. This seems wrong. For the
reasons noted in the existing comments, it's good for the backend that
finishes the scan to report the starting position as the current
position hint, but you don't want every parallel backend to do it
turn. Unrelated, overlapping scans might be trying to continue
advancing the scan, and you don't want to drag the position hint
backward for no reason.

- heap_parallelscan_initialize_startblock() calls ss_get_location()
while holding a spinlock. This is clearly no good, because spinlocks
can only be held while executing straight-line code that does not
itself acquire locks - and ss_get_location() takes an *LWLock*. Among
other problems, an error anywhere inside ss_get_location() would leave
behind a stuck spinlock.

- There's no point that I can see in initializing rs_startblock at all
when a ParallelHeapScanDesc is in use. The ParallelHeapScanDesc, not
rs_startblock, is going to tell you what page to scan next.

I think heap_parallelscan_initialize_startblock() should basically do
this, in the synchronized scan case:

SpinLockAcquire(&parallel_scan->phs_mutex);
page = parallel_scan->phs_startblock;
SpinLockRelease(&parallel_scan->phs_mutex);
if (page != InvalidBlockNumber)
return; /* some other process already did this */
page = ss_get_location(scan->rs_rd, scan->rs_nblocks);
SpinLockAcquire(&parallel_scan->phs_mutex);
/* even though we checked before, someone might have beaten us here */
if (parallel_scan->phs_startblock == InvalidBlockNumber)
{
parallel_scan->phs_startblock = page;
parallel_scan->phs_cblock = page;
}
SpinLockRelease(&parallel_scan->phs_mutex);

- heap_parallelscan_nextpage() seems to have gotten unnecessarily
complicated. I particularly dislike the way you increment phs_cblock
and then sometimes try to back it out later. Let's decide that
phs_cblock == InvalidBlockNumber means that the scan is finished,
while phs_cblock == phs_startblock means that we're just starting. We
then don't need phs_firstpass at all, and can write:

SpinLockAcquire(&parallel_scan->phs_mutex);
page = parallel_scan->phs_cblock;
if (page != InvalidBlockNumber)
{
parallel_scan->phs_cblock++;
if (parallel_scan->phs_cblock >= scan->rs_nblocks)
parallel_scan->phs_cblock = 0;
if (parallel_scan->phs_cblock == parallel_scan->phs_startblock)
{
parallel_scan->phs_cblock = InvalidBlockNumber;
report_scan_done = true;
}
}
SpinLockRelease(&parallel_scan->phs_mutex);

At this point, if page contains InvalidBlockNumber, then the scan is
done, and if it contains anything else, that's the next page that the
current process should scan. If report_scan_done is true, we are the
first to observe that the scan is done and should call
ss_report_location().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#370Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#368)
1 attachment(s)
Re: Parallel Seq Scan

On Tue, Sep 29, 2015 at 10:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 29, 2015 at 12:39 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Attached patch is a rebased patch based on latest commit (d1b7c1ff)
for Gather node.

- I have to reorganize the defines in execParallel.h and .c. To keep
ParallelExecutorInfo, in GatherState node, we need to include

execParallel.h

in execnodes.h which was creating compilation issues as execParallel.h
also includes execnodes.h, so for now I have defined

ParallelExecutorInfo

in execnodes.h and instrumentation related structures in instrument.h.
- Renamed parallel_seqscan_degree to degree_of_parallelism
- Rename Funnel to Gather
- Removed PARAM_EXEC parameter handling code, I think we can do this
separately.

I have to work more on partial seq scan patch for rebasing it and

handling

review comments for the same, so for now I am sending the first part of
patch (which included Gather node functionality and some general support
for parallel-query execution).

Thanks for the fast rebase.

This patch needs a bunch of cleanup:

- The formatting for the GatherState node's comment block is unlike
that of surrounding comment blocks. It lacks the ------- dividers,
and the indentation is not the same. Also, it refers to
ParallelExecutorInfo by the type name, but the other members by
structure member name. The convention is to refer to them by
structure member name, so please do that.

- The naming of fs_workersReady is inconsistent with the other
structure members. The other members use all lower-case names,
separating words with dashes, but this one uses a capital letter. The
other members also don't prefix the names with anything, but this uses
a "fs_" prefix which I assume is left over from when this was called
FunnelState. Finally, this doesn't actually tell you when workers are
ready, just whether they were launched. I suggest we rename this to
"any_worker_launched".

- Instead of moving the declaration of ParallelExecutorInfo, please
just refer to it as "struct ParallelExecutorInfo" in execnodes.h.
That way, you're not sucking these includes into all kinds of places
they don't really need to be.

- Let's not create a new PARALLEL_QUERY category of GUC. Instead,
let's the GUC for the number of workers with under resource usage ->
asynchronous behavior.

Changed as per suggestion.

- I don't believe that shm_toc *toc has any business being part of a
generic PlanState node. At most, it should be part of an individual
type of PlanState, like a GatherState or PartialSeqScanState. But
really, I don't see why we need it there at all.

We need it for getting parallelheapscan descriptor in case of
partial sequence scan node, it doesn't seem like a good idea
to retrieve it in begining, as we need to dig into plan tree to
get the node_id for getting the value of parallelheapscan descriptor
from toc.

Now, I think we can surely keep it in PartialSeqScanState or any
other node state which might need it later, but I felt this is quite
generic and we might need to fetch node specific information from toc
going forward.

- I'd like to do some renaming of the new GUCs. I suggest we rename
cpu_tuple_comm_cost to parallel_tuple_cost and degree_of_parallelism
to max_parallel_degree.

Changed as per suggestion.

- I think that a Gather node should inherit from Plan, not Scan. A
Gather node really shouldn't have a scanrelid. Now, admittedly, if
the only thing under the Gather is a Partial Seq Scan, it wouldn't be
totally bonkers to think of the Gather as scanning the same relation
that the Partial Seq Scan is scanning. But in any more complex case,
like where it's scanning a join, you're out of luck. You'll have to
set scanrelid == 0, I suppose, but then, for example, ExecScanReScan
is not going to work. In fact, as far as I can see, the only way
nodeGather.c is actually using any of the generic scan stuff is by
calling ExecInitScanTupleSlot, which is all of one line of code.
ExecEndGather fetches node->ss.ss_currentRelation but then does
nothing with it. So I think this is just a holdover from early
version of this patch where what's now Gather and PartialSeqScan were
a single node, and I think we should rip it out.

makes sense and I think GatherState should also be inherit from PlanState
instead of ScanState which I have changed in patch attached.

- On a related note, the assertions in cost_gather() are both bogus
and should be removed. Similarly with create_gather_plan(). As
previously mentioned, the Gather node should not care what sort of
thing is under it; I am not interested in restricting it to baserels
and then undoing that later.

- For the same reason, cost_gather() should refer to it's third
argument as "rel" not "baserel".

Changed as per suggestion.

- Also, I think this stuff about physical tlists in
create_gather_plan() is bogus. use_physical_tlist is ignorant of the
possibility that the RelOptInfo passed to it might be anything other
than a baserel, and I think it won't be happy if it gets a joinrel.
Moreover, I think our plan here is that, at least for now, the
Gather's tlist will always match the tlist of its child. If that's
so, there's no point to this: it will end up with the same tlist
either way. If any projection is needed, it should be done by the
Gather node's child, not the Gather node itself.

Yes, Gather node itself doesn't need to do projection, but it
needs the projection info to store the same in Slot after fetching
the tuple from tuple queue. Now this is not required for Gather
node itself, but it might be required for any node on top of
Gather node.

Here, I think one thing we could do is that use the subplan's target
list as currently is being done for quals. The only risk is what if
Gating node is added on top of partialseqscan (subplan), but I have checked
that is safe, because Gating plan uses the same target list as it's child.
Also I don't think we need to process any quals at Gather node, so I will
make that as Null, I will do this change in next version unless you see
any problem with it.

Yet another idea is during set_plan_refs(), we can assign leftchild's
target list to parent in case of Gather node (right now it's done in
reverse way which needs to be changed.)

What is your preference?

- Let's rename DestroyParallelSetupAndAccumStats to
ExecShutdownGather. Instead of encasing the entire function in if
statement, let's start with if (node->pei == NULL || node->pei->pcxt
== NULL) return.

- ExecParallelBufferUsageAccum should be declared to take an argument
of type PlanState, not Node. Then you don't have to cast what you are
passing to it, and it doesn't have to cast before calling itself. And,
let's also rename it to ExecShutdownNode and move it to
execProcnode.c. Having a "shutdown phase" that stops a node from
asynchronously consuming additional resources could be useful for
non-parallel node types - especially ForeignScan and CustomScan. And
we could eventually extend this to be called in other situations, like
when a Limit is filled give everything beneath it a chance to ease up.
We don't have to do those bits of work right now but it seems well
worth making this look like a generic facility.

- Calling DestroyParallelSetupAndAccumStats from ExplainNode when we
actually reach the Gather node is much too late. We should really be
shutting down parallel workers at the end of the ExecutorRun phase, or
certainly no later than ExecutorFinish. In fact, you have
standard_ExecutorRun calling ExecParallelBufferUsageAccum() but only
if queryDesc->totaltime is set. What I think you should do instead is
call ExecShutdownNode a few lines earlier, before shutting down the
tuple receiver, and do so unconditionally. That way, the workers are
always shut down in the ExecutorRun phase, which should eliminate the
need for this bit in explain.c.

- The changes to postmaster.c and postgres.c consist of only
additional #includes. Those can, presumably, be reverted.

Changed as per suggestion.

Note - You will see one or two unrelated changes due to pgindent, but they
are from to parallel seq scan related files, so I have retained them. It
might be better to do them separately, however as I was doing pgindent
for those files, so I thought it is okay to retain changed for related
files.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_gather_v20.patchapplication/octet-stream; name=parallel_seqscan_gather_v20.patchDownload
diff --git a/src/backend/access/common/printtup.c b/src/backend/access/common/printtup.c
index baed981..639451a 100644
--- a/src/backend/access/common/printtup.c
+++ b/src/backend/access/common/printtup.c
@@ -26,9 +26,9 @@
 
 static void printtup_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-static void printtup(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_20(TupleTableSlot *slot, DestReceiver *self);
-static void printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_20(TupleTableSlot *slot, DestReceiver *self);
+static bool printtup_internal_20(TupleTableSlot *slot, DestReceiver *self);
 static void printtup_shutdown(DestReceiver *self);
 static void printtup_destroy(DestReceiver *self);
 
@@ -299,7 +299,7 @@ printtup_prepare_info(DR_printtup *myState, TupleDesc typeinfo, int numAttrs)
  *		printtup --- print a tuple in protocol 3.0
  * ----------------
  */
-static void
+static bool
 printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -376,13 +376,15 @@ printtup(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
  *		printtup_20 --- print a tuple in protocol 2.0
  * ----------------
  */
-static void
+static bool
 printtup_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -452,6 +454,8 @@ printtup_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
 
 /* ----------------
@@ -528,7 +532,7 @@ debugStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		debugtup - print one tuple for an interactive backend
  * ----------------
  */
-void
+bool
 debugtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -553,6 +557,8 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
 		printatt((unsigned) i + 1, typeinfo->attrs[i], value);
 	}
 	printf("\t----\n");
+
+	return true;
 }
 
 /* ----------------
@@ -564,7 +570,7 @@ debugtup(TupleTableSlot *slot, DestReceiver *self)
  * This is largely same as printtup_20, except we use binary formatting.
  * ----------------
  */
-static void
+static bool
 printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 {
 	TupleDesc	typeinfo = slot->tts_tupleDescriptor;
@@ -636,4 +642,6 @@ printtup_internal_20(TupleTableSlot *slot, DestReceiver *self)
 	/* Return to caller's context, and flush row's temporary memory */
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
+
+	return true;
 }
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f409aa7..42d4a44 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4411,7 +4411,7 @@ copy_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * copy_dest_receive --- receive one tuple
  */
-static void
+static bool
 copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_copy    *myState = (DR_copy *) self;
@@ -4423,6 +4423,8 @@ copy_dest_receive(TupleTableSlot *slot, DestReceiver *self)
 	/* And send the data */
 	CopyOneRowTo(cstate, InvalidOid, slot->tts_values, slot->tts_isnull);
 	myState->processed++;
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 41183f6..418b0f6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,7 +62,7 @@ typedef struct
 static ObjectAddress CreateAsReladdr = {InvalidOid, InvalidOid, 0};
 
 static void intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void intorel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool intorel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void intorel_shutdown(DestReceiver *self);
 static void intorel_destroy(DestReceiver *self);
 
@@ -482,7 +482,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * intorel_receive --- receive one tuple
  */
-static void
+static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
@@ -507,6 +507,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f0d9e94..4481505 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -730,6 +730,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -853,6 +854,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_Gather:
+			pname = sname = "Gather";
+			break;
 		case T_IndexScan:
 			pname = sname = "Index Scan";
 			break;
@@ -1003,6 +1007,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1276,6 +1281,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			break;
+		case T_Gather:
+			show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+			if (plan->qual)
+				show_instrumentation_count("Rows Removed by Filter", 1,
+										   planstate, es);
+			ExplainPropertyInteger("Number of Workers",
+								   ((Gather *) plan)->num_workers, es);
+			break;
 		case T_FunctionScan:
 			if (es->verbose)
 			{
@@ -2335,6 +2348,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5492e59..750a59c 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -56,7 +56,7 @@ typedef struct
 static int	matview_maintenance_depth = 0;
 
 static void transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool transientrel_receive(TupleTableSlot *slot, DestReceiver *self);
 static void transientrel_shutdown(DestReceiver *self);
 static void transientrel_destroy(DestReceiver *self);
 static void refresh_matview_datafill(DestReceiver *dest, Query *query,
@@ -422,7 +422,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * transientrel_receive --- receive one tuple
  */
-static void
+static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
@@ -441,6 +441,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 				myState->bistate);
 
 	/* We know this is a newly created relation, so there are no indexes */
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index f5e1e1a..51edd4c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -17,8 +17,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
-       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeHash.o \
-       nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
+       nodeBitmapHeapscan.o nodeBitmapIndexscan.o nodeCustom.o nodeGather.o \
+       nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
        nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 93e1e9a..163650c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -24,6 +24,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeFunctionscan.h"
+#include "executor/nodeGather.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
@@ -160,6 +161,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_GatherState:
+			ExecReScanGather((GatherState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecReScanIndexScan((IndexScanState *) node);
 			break;
@@ -467,6 +472,9 @@ ExecSupportsBackwardScan(Plan *node)
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
 
+		case T_Gather:
+			return false;
+
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
 				TargetListSupportsBackwardScan(node->targetlist);
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..fd89204 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_GatherState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 85ff46b..1dc2ad7 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -348,6 +348,14 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 					dest);
 
 	/*
+	 * Accumulate stats and shutdown parallel workers before stopping node.
+	 * Though we already accumulate this information when last tuple is
+	 * fetched from parallel workers, this is to cover cases when we don't
+	 * fetch all tuples from a node such as for Limit node.
+	 */
+	(void) ExecShutdownNode(queryDesc->planstate);
+
+	/*
 	 * shutdown tuple receiver, if we started it
 	 */
 	if (sendTuples)
@@ -1581,7 +1589,15 @@ ExecutePlan(EState *estate,
 		 * practice, this is probably always the case at this point.)
 		 */
 		if (sendTuples)
-			(*dest->receiveSlot) (slot, dest);
+		{
+			/*
+			 * If we are not able to send the tuple, we assume the destination
+			 * has closed and no more tuples can be sent. If that's the case,
+			 * end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
+		}
 
 		/*
 		 * Count tuples processed, if this is a SELECT.  (For other operation
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 03c2feb..578e28b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -113,6 +114,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
 
@@ -196,6 +198,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 													  estate, eflags);
 			break;
 
+		case T_Gather:
+			result = (PlanState *) ExecInitGather((Gather *) node,
+												  estate, eflags);
+			break;
+
 		case T_IndexScan:
 			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
 													 estate, eflags);
@@ -416,6 +423,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
 
+		case T_GatherState:
+			result = ExecGather((GatherState *) node);
+			break;
+
 		case T_IndexScanState:
 			result = ExecIndexScan((IndexScanState *) node);
 			break;
@@ -658,6 +669,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_GatherState:
+			ExecEndGather((GatherState *) node);
+			break;
+
 		case T_IndexScanState:
 			ExecEndIndexScan((IndexScanState *) node);
 			break;
@@ -769,3 +784,30 @@ ExecEndNode(PlanState *node)
 			break;
 	}
 }
+
+/*
+ * ExecShutdownNode
+ *
+ * Recursively shutdown and accumulate the stats for all the Gather nodes
+ * in a plan state tree.
+ */
+bool
+ExecShutdownNode(PlanState *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_GatherState:
+			{
+				ExecShutdownGather((GatherState *) node);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker(node, ExecShutdownNode, NULL);
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a05d8b1..d5619bd 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1313,7 +1313,7 @@ do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull)
 	ExecStoreVirtualTuple(slot);
 
 	/* send the tuple to the receiver */
-	(*tstate->dest->receiveSlot) (slot, tstate->dest);
+	(void) (*tstate->dest->receiveSlot) (slot, tstate->dest);
 
 	/* clean up */
 	ExecClearTuple(slot);
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..863bd64 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -167,7 +167,7 @@ static Datum postquel_get_single_result(TupleTableSlot *slot,
 static void sql_exec_error_callback(void *arg);
 static void ShutdownSQLFunction(Datum arg);
 static void sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo);
-static void sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
+static bool sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self);
 static void sqlfunction_shutdown(DestReceiver *self);
 static void sqlfunction_destroy(DestReceiver *self);
 
@@ -1903,7 +1903,7 @@ sqlfunction_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 /*
  * sqlfunction_receive --- receive one tuple
  */
-static void
+static bool
 sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_sqlfunction *myState = (DR_sqlfunction *) self;
@@ -1913,6 +1913,8 @@ sqlfunction_receive(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Store the filtered tuple into the tuplestore */
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
new file mode 100644
index 0000000..6ec3c86
--- /dev/null
+++ b/src/backend/executor/nodeGather.c
@@ -0,0 +1,309 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeGather.c
+ *	  Support routines for scanning a relation via multiple workers.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodeGather.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecGather				scans a relation using worker backends.
+ *		ExecInitGather			creates and initializes a Gather node.
+ *		ExecEndGather			releases any storage allocated.
+ *		ExecReScanGather		Re-initialize the workers and rescans a relation via them.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodeGather.h"
+#include "executor/nodeSubplan.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *gather_getnext(GatherState *gatherstate);
+
+
+/* ----------------------------------------------------------------
+ *		ExecInitGather
+ * ----------------------------------------------------------------
+ */
+GatherState *
+ExecInitGather(Gather *node, EState *estate, int eflags)
+{
+	GatherState *gatherstate;
+
+	/* Gather node doesn't have innerPlan node. */
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	gatherstate = makeNode(GatherState);
+	gatherstate->ps.plan = (Plan *) node;
+	gatherstate->ps.state = estate;
+	gatherstate->any_worker_launched = false;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &gatherstate->ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	gatherstate->ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) gatherstate);
+	gatherstate->ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) gatherstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &gatherstate->ps);
+
+	/*
+	 * now initialize outer plan
+	 */
+	outerPlanState(gatherstate) = ExecInitNode(outerPlan(node), estate, eflags);
+
+
+	gatherstate->ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&gatherstate->ps);
+	ExecAssignProjectionInfo(&gatherstate->ps, NULL);
+
+	return gatherstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecGather(node)
+ *
+ *		Scans the relation via multiple workers and returns
+ *		the next qualifying tuple.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecGather(GatherState *node)
+{
+	int			i;
+	TupleTableSlot *slot;
+
+	/*
+	 * Initialize the parallel context and workers on first execution. We do
+	 * this on first execution rather than during node initialization, as it
+	 * needs to allocate large dynamic segement, so it is better to do if it
+	 * is really needed.
+	 */
+	if (!node->pei->pcxt)
+	{
+		EState	   *estate = node->ps.state;
+		bool		any_worker_launched = false;
+
+		/* Initialize the workers required to execute Gather node. */
+		node->pei = ExecInitParallelPlan(node->ps.lefttree,
+										 estate,
+								  ((Gather *) (node->ps.plan))->num_workers);
+
+		outerPlanState(node)->toc = node->pei->pcxt->toc;
+
+		/*
+		 * Register backend workers. If the required number of workers are not
+		 * available then we perform the scan with available workers and if
+		 * there are no more workers available, then the Gather node will just
+		 * scan locally.
+		 */
+		LaunchParallelWorkers(node->pei->pcxt);
+
+		node->funnel = CreateTupleQueueFunnel();
+
+		for (i = 0; i < node->pei->pcxt->nworkers; ++i)
+		{
+			if (node->pei->pcxt->worker[i].bgwhandle)
+			{
+				shm_mq_set_handle((node->pei->tqueue)[i], node->pei->pcxt->worker[i].bgwhandle);
+				RegisterTupleQueueOnFunnel(node->funnel, (node->pei->tqueue)[i]);
+				any_worker_launched = true;
+			}
+		}
+
+		if (any_worker_launched)
+			node->any_worker_launched = true;
+	}
+
+	slot = gather_getnext(node);
+
+	if (TupIsNull(slot))
+	{
+		/*
+		 * Destroy the parallel context once we complete fetching all the
+		 * tuples, this will ensure that if in the same statement we need to
+		 * have Gather node for multiple parts of statement, it won't
+		 * accumulate lot of dsm segments and workers can be made available to
+		 * use by other parts of statement.
+		 */
+		ExecShutdownGather(node);
+	}
+	return slot;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndGather
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndGather(GatherState *node)
+{
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+	ExecEndNode(outerPlanState(node));
+
+	ExecShutdownGather(node);
+}
+
+/*
+ * gather_getnext
+ *
+ * Get the next tuple from shared memory queue.  This function
+ * is reponsible for fetching tuples from all the queues associated
+ * with worker backends used in Gather node execution and if there is
+ * no data available from queues or no worker is available, it does
+ * fetch the data from local node.
+ */
+TupleTableSlot *
+gather_getnext(GatherState *gatherstate)
+{
+	PlanState  *outerPlan;
+	TupleTableSlot *outerTupleSlot;
+	TupleTableSlot *slot;
+	HeapTuple	tup;
+
+	/*
+	 * We can use projection info of Gather for the tuples received from
+	 * worker backends as currently for all cases worker backends sends the
+	 * projected tuple as required by Gather node.
+	 */
+	slot = gatherstate->ps.ps_ProjInfo->pi_slot;
+
+	while ((!gatherstate->all_workers_done &&
+			gatherstate->any_worker_launched) ||
+		   !gatherstate->local_scan_done)
+	{
+		if (!gatherstate->all_workers_done && gatherstate->any_worker_launched)
+		{
+			/* wait only if local scan is done */
+			tup = TupleQueueFunnelNext(gatherstate->funnel,
+									   !gatherstate->local_scan_done,
+									   &gatherstate->all_workers_done);
+
+			if (HeapTupleIsValid(tup))
+			{
+				ExecStoreTuple(tup,		/* tuple to store */
+							   slot,	/* slot to store in */
+							   InvalidBuffer,	/* buffer associated with this
+												 * tuple */
+							   true);	/* pfree this pointer if not from heap */
+
+				return slot;
+			}
+		}
+		if (!gatherstate->local_scan_done)
+		{
+			outerPlan = outerPlanState(gatherstate);
+
+			outerTupleSlot = ExecProcNode(outerPlan);
+
+			if (!TupIsNull(outerTupleSlot))
+				return outerTupleSlot;
+
+			gatherstate->local_scan_done = true;
+		}
+	}
+
+	return ExecClearTuple(slot);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecShutdownGather
+ *
+ *		Destroy the setup for parallel workers.  Collect all the
+ *		stats after workers are stopped, else some work done by
+ *		workers won't be accounted.
+ * ----------------------------------------------------------------
+ */
+void
+ExecShutdownGather(GatherState *node)
+{
+	if (node->pei == NULL || node->pei->pcxt == NULL)
+		return;
+
+	/*
+	 * Ensure all workers have finished before destroying the parallel context
+	 * to ensure a clean exit.
+	 */
+	if (node->any_worker_launched)
+	{
+		DestroyTupleQueueFunnel(node->funnel);
+		node->funnel = NULL;
+	}
+
+	ExecParallelFinish(node->pei);
+
+	/* destroy parallel context. */
+	DestroyParallelContext(node->pei->pcxt);
+	node->pei->pcxt = NULL;
+
+	node->any_worker_launched = false;
+	node->all_workers_done = false;
+	node->local_scan_done = false;
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanGather
+ *
+ *		Re-initialize the workers and rescans a relation via them.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanGather(GatherState *node)
+{
+	/*
+	 * Re-initialize the parallel context and workers to perform rescan of
+	 * relation.  We want to gracefully shutdown all the workers so that they
+	 * should be able to propagate any error or other information to master
+	 * backend before dying.
+	 */
+	ExecShutdownGather(node);
+
+	ExecReScan(node->ps.lefttree);
+}
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 300401e..a60f228 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -1774,7 +1774,7 @@ spi_dest_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  *		store tuple retrieved by Executor into SPITupleTable
  *		of current SPI procedure
  */
-void
+bool
 spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 {
 	SPITupleTable *tuptable;
@@ -1809,6 +1809,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self)
 	(tuptable->free)--;
 
 	MemoryContextSwitchTo(oldcxt);
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 67143d3..28ba49f 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -41,14 +41,24 @@ struct TupleQueueFunnel
 /*
  * Receive a tuple.
  */
-static void
+static bool
 tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 {
 	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
 	HeapTuple	tuple;
+	shm_mq_result result;
 
 	tuple = ExecMaterializeSlot(slot);
-	shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+	result = shm_mq_send(tqueue->handle, tuple->t_len, tuple->t_data, false);
+
+	if (result == SHM_MQ_DETACHED)
+		return false;
+	else if (result != SHM_MQ_SUCCESS)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("unable to send tuples")));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/executor/tstoreReceiver.c b/src/backend/executor/tstoreReceiver.c
index c1fdeb7..b0862ae 100644
--- a/src/backend/executor/tstoreReceiver.c
+++ b/src/backend/executor/tstoreReceiver.c
@@ -37,8 +37,8 @@ typedef struct
 } TStoreState;
 
 
-static void tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
-static void tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self);
+static bool tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self);
 
 
 /*
@@ -90,19 +90,21 @@ tstoreStartupReceiver(DestReceiver *self, int operation, TupleDesc typeinfo)
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the easy case where we don't have to detoast.
  */
-static void
+static bool
 tstoreReceiveSlot_notoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
 
 	tuplestore_puttupleslot(myState->tstore, slot);
+
+	return true;
 }
 
 /*
  * Receive a tuple from the executor and store it in the tuplestore.
  * This is for the case where we have to detoast any toasted values.
  */
-static void
+static bool
 tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 {
 	TStoreState *myState = (TStoreState *) self;
@@ -152,6 +154,8 @@ tstoreReceiveSlot_detoast(TupleTableSlot *slot, DestReceiver *self)
 	/* And release any temporary detoasted values */
 	for (i = 0; i < nfree; i++)
 		pfree(DatumGetPointer(myState->tofree[i]));
+
+	return true;
 }
 
 /*
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 4b4ddec..085e6bd 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -314,6 +314,27 @@ _copyBitmapOr(const BitmapOr *from)
 	return newnode;
 }
 
+/*
+ * _copyGather
+ */
+static Gather *
+_copyGather(const Gather *from)
+{
+	Gather	   *newnode = makeNode(Gather);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+	/*
+	 * copy remainder of node
+	 */
+	COPY_SCALAR_FIELD(num_workers);
+
+	return newnode;
+}
+
 
 /*
  * CopyScanFields
@@ -4235,6 +4256,9 @@ copyObject(const void *from)
 		case T_Scan:
 			retval = _copyScan(from);
 			break;
+		case T_Gather:
+			retval = _copyGather(from);
+			break;
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ee9c360..0b5600e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -433,6 +433,16 @@ _outBitmapOr(StringInfo str, const BitmapOr *node)
 }
 
 static void
+_outGather(StringInfo str, const Gather *node)
+{
+	WRITE_NODE_TYPE("GATHER");
+
+	_outPlanInfo(str, (const Plan *) node);
+
+	WRITE_UINT_FIELD(num_workers);
+}
+
+static void
 _outScan(StringInfo str, const Scan *node)
 {
 	WRITE_NODE_TYPE("SCAN");
@@ -3000,6 +3010,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_BitmapOr:
 				_outBitmapOr(str, obj);
 				break;
+			case T_Gather:
+				_outGather(str, obj);
+				break;
 			case T_Scan:
 				_outScan(str, obj);
 				break;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index d107d76..1b61fd9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -11,6 +11,8 @@
  *	cpu_tuple_cost		Cost of typical CPU time to process a tuple
  *	cpu_index_tuple_cost  Cost of typical CPU time to process an index tuple
  *	cpu_operator_cost	Cost of CPU time to execute an operator or function
+ *	parallel_tuple_cost Cost of CPU time to pass a tuple from worker to master backend
+ *	parallel_setup_cost Cost of setting up shared memory for parallelism
  *
  * We expect that the kernel will typically do some amount of read-ahead
  * optimization; this in conjunction with seek costs means that seq_page_cost
@@ -102,11 +104,15 @@ double		random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double		cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double		cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double		cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
+double		parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST;
+double		parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int			max_parallel_degree = 0;
+
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
@@ -290,6 +296,38 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_gather
+ *	  Determines and returns the cost of gather path.
+ *
+ * 'rel' is the relation to be operated upon
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ */
+void
+cost_gather(GatherPath *path, PlannerInfo *root,
+			RelOptInfo *rel, ParamPathInfo *param_info)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	/* Mark the path with the correct row estimate */
+	if (param_info)
+		path->path.rows = param_info->ppi_rows;
+	else
+		path->path.rows = rel->rows;
+
+	startup_cost = path->subpath->startup_cost;
+
+	run_cost = path->subpath->total_cost - path->subpath->startup_cost;
+
+	/* Parallel setup and communication cost. */
+	startup_cost += parallel_setup_cost;
+	run_cost += parallel_tuple_cost * rel->tuples;
+
+	path->path.startup_cost = startup_cost;
+	path->path.total_cost = (startup_cost + run_cost);
+}
+
+/*
  * cost_index
  *	  Determines and returns the cost of scanning a relation using an index.
  *
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 404c6f5..6a1c7fc 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Gather *create_gather_plan(PlannerInfo *root,
+				   GatherPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
 					  List *tlist, List *scan_clauses, bool indexonly);
 static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
@@ -104,6 +106,9 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static Gather *make_gather(List *qptlist, List *qpqual,
+			Index scanrelid, int nworkers,
+			Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 			   Oid indexid, List *indexqual, List *indexqualorig,
 			   List *indexorderby, List *indexorderbyorig,
@@ -273,6 +278,10 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 			plan = create_unique_plan(root,
 									  (UniquePath *) best_path);
 			break;
+		case T_Gather:
+			plan = (Plan *) create_gather_plan(root,
+											   (GatherPath *) best_path);
+			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
 				 (int) best_path->pathtype);
@@ -560,6 +569,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_Gather:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1194,6 +1204,62 @@ create_samplescan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_gather_plan
+ *
+ * Returns a gather plan for the base relation scanned by
+ * 'best_path'.
+ */
+static Gather *
+create_gather_plan(PlannerInfo *root, GatherPath *best_path)
+{
+	Gather	   *gather_plan;
+	Plan	   *subplan;
+	List	   *tlist;
+	RelOptInfo *rel = best_path->path.parent;
+	Index		scan_relid = best_path->path.parent->relid;
+
+	/*
+	 * For table scans, rather than using the relation targetlist (which is
+	 * only those Vars actually needed by the query), we prefer to generate a
+	 * tlist containing all Vars in order.  This will allow the executor to
+	 * optimize away projection of the table tuples, if possible.  (Note that
+	 * planner.c may replace the tlist we generate here, forcing projection to
+	 * occur.)
+	 */
+	if (use_physical_tlist(root, rel))
+	{
+		tlist = build_physical_tlist(root, rel);
+		/* if fail because of dropped cols, use regular method */
+		if (tlist == NIL)
+			tlist = build_path_tlist(root, &best_path->path);
+	}
+	else
+	{
+		tlist = build_path_tlist(root, &best_path->path);
+	}
+
+	subplan = create_plan_recurse(root, best_path->subpath);
+
+	/*
+	 * quals for subplan and top level plan are same as either all the quals
+	 * are pushed to subplan (partialseqscan plan) or parallel plan won't be
+	 * choosen.
+	 */
+	gather_plan = make_gather(tlist,
+							  subplan->qual,
+							  scan_relid,
+							  best_path->num_workers,
+							  subplan);
+
+	copy_path_costsize(&gather_plan->plan, &best_path->path);
+
+	/* use parallel mode for parallel plans. */
+	root->glob->parallelModeNeeded = true;
+
+	return gather_plan;
+}
+
+/*
  * create_indexscan_plan
  *	  Returns an indexscan plan for the base relation scanned by 'best_path'
  *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3462,6 +3528,26 @@ make_samplescan(List *qptlist,
 	return node;
 }
 
+static Gather *
+make_gather(List *qptlist,
+			List *qpqual,
+			Index scanrelid,
+			int nworkers,
+			Plan *subplan)
+{
+	Gather	   *node = makeNode(Gather);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = subplan;
+	plan->righttree = NULL;
+	node->num_workers = nworkers;
+
+	return node;
+}
+
 static IndexScan *
 make_indexscan(List *qptlist,
 			   List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3c81697..9ea4007 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -470,6 +470,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					fix_scan_expr(root, (Node *) splan->tablesample, rtoffset);
 			}
 			break;
+		case T_Gather:
+			{
+				Gather	   *splan = (Gather *) plan;
+
+				/*
+				 * target list for leftree of gather plan should be same as
+				 * for gather scan as both nodes need to produce same
+				 * projection. We don't want to do this assignment after
+				 * fixing references as that will be done separately for
+				 * lefttree node.
+				 */
+				splan->plan.lefttree->targetlist = splan->plan.targetlist;
+
+				splan->plan.targetlist =
+					fix_scan_list(root, splan->plan.targetlist, rtoffset);
+				splan->plan.qual =
+					fix_scan_list(root, splan->plan.qual, rtoffset);
+			}
+			break;
 		case T_IndexScan:
 			{
 				IndexScan  *splan = (IndexScan *) plan;
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d0bc412..78f3ce1 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2243,6 +2243,10 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
+		case T_Gather:
+			context.paramids = bms_add_members(context.paramids, scan_params);
+			break;
+
 		case T_IndexScan:
 			finalize_primnode((Node *) ((IndexScan *) plan)->indexqual,
 							  &context);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 4336ca1..e3f3539 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -732,6 +732,32 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
 }
 
 /*
+ * create_gather_path
+ *
+ *	  Creates a path corresponding to a gather scan, returning the
+ *	  pathnode.
+ */
+GatherPath *
+create_gather_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
+				   Relids required_outer, int nworkers)
+{
+	GatherPath *pathnode = makeNode(GatherPath);
+
+	pathnode->path.pathtype = T_Gather;
+	pathnode->path.parent = rel;
+	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
+														  required_outer);
+	pathnode->path.pathkeys = NIL;		/* Gather has unordered result */
+
+	pathnode->subpath = subpath;
+	pathnode->num_workers = nworkers;
+
+	cost_gather(pathnode, root, rel, pathnode->path.param_info);
+
+	return pathnode;
+}
+
+/*
  * create_index_path
  *	  Creates a path node for an index scan.
  *
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index d645751..bed1ef2 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -45,9 +45,10 @@
  *		dummy DestReceiver functions
  * ----------------
  */
-static void
+static bool
 donothingReceive(TupleTableSlot *slot, DestReceiver *self)
 {
+	return true;
 }
 
 static void
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 0df86a2..5eab231 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1117,7 +1117,13 @@ RunFromStore(Portal portal, ScanDirection direction, long count,
 			if (!ok)
 				break;
 
-			(*dest->receiveSlot) (slot, dest);
+			/*
+			 * If we are not able to send the tuple, we assume the destination
+			 * has closed and no more tuples can be sent. If that's the case,
+			 * end the loop.
+			 */
+			if (!((*dest->receiveSlot) (slot, dest)))
+				break;
 
 			ExecClearTuple(slot);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17053af..4b03e9e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2535,6 +2535,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"max_parallel_degree", PGC_SUSET, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Sets the maximum number of simultaneously running backend worker processes."),
+			NULL
+		},
+		&max_parallel_degree,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
@@ -2711,6 +2721,26 @@ static struct config_real ConfigureNamesReal[] =
 		DEFAULT_CPU_OPERATOR_COST, 0, DBL_MAX,
 		NULL, NULL, NULL
 	},
+	{
+		{"parallel_tuple_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+				  "passing each tuple (row) from worker to master backend."),
+			NULL
+		},
+		&parallel_tuple_cost,
+		DEFAULT_PARALLEL_TUPLE_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
+	{
+		{"parallel_setup_cost", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the planner's estimate of the cost of "
+				  "setting up environment (shared memory) for parallelism."),
+			NULL
+		},
+		&parallel_setup_cost,
+		DEFAULT_PARALLEL_SETUP_COST, 0, DBL_MAX,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"cursor_tuple_fraction", PGC_USERSET, QUERY_TUNING_OTHER,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8c65287..e7f1250 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -164,6 +164,7 @@
 
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #max_worker_processes = 8
+#max_parallel_degree = 0		# max number of worker backend subprocesses
 
 
 #------------------------------------------------------------------------------
@@ -290,6 +291,8 @@
 #cpu_tuple_cost = 0.01			# same scale as above
 #cpu_index_tuple_cost = 0.005		# same scale as above
 #cpu_operator_cost = 0.0025		# same scale as above
+#parallel_tuple_cost = 0.1		# same scale as above
+#parallel_setup_cost = 1000.0	# same scale as above
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
diff --git a/src/include/access/printtup.h b/src/include/access/printtup.h
index 46c4148..92ec882 100644
--- a/src/include/access/printtup.h
+++ b/src/include/access/printtup.h
@@ -25,11 +25,11 @@ extern void SendRowDescriptionMessage(TupleDesc typeinfo, List *targetlist,
 
 extern void debugStartup(DestReceiver *self, int operation,
 			 TupleDesc typeinfo);
-extern void debugtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool debugtup(TupleTableSlot *slot, DestReceiver *self);
 
 /* XXX these are really in executor/spi.c */
 extern void spi_dest_startup(DestReceiver *self, int operation,
 				 TupleDesc typeinfo);
-extern void spi_printtup(TupleTableSlot *slot, DestReceiver *self);
+extern bool spi_printtup(TupleTableSlot *slot, DestReceiver *self);
 
 #endif   /* PRINTTUP_H */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 4fc797a..fd1142f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -13,21 +13,18 @@
 #ifndef EXECPARALLEL_H
 #define EXECPARALLEL_H
 
-#include "access/parallel.h"
 #include "nodes/execnodes.h"
-#include "nodes/parsenodes.h"
-#include "nodes/plannodes.h"
 
 typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
 
 typedef struct ParallelExecutorInfo
 {
-	PlanState *planstate;
+	PlanState  *planstate;
 	ParallelContext *pcxt;
 	BufferUsage *buffer_usage;
 	SharedExecutorInstrumentation *instrumentation;
 	shm_mq_handle **tqueue;
-}	ParallelExecutorInfo;
+} ParallelExecutorInfo;
 
 extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
 					 EState *estate, int nworkers);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 226f905..4f77692 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -225,6 +225,7 @@ extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
+extern bool ExecShutdownNode(PlanState *node);
 
 /*
  * prototypes from functions in execQual.c
diff --git a/src/include/executor/nodeGather.h b/src/include/executor/nodeGather.h
new file mode 100644
index 0000000..9e5d8fc
--- /dev/null
+++ b/src/include/executor/nodeGather.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeGather.h
+ *		prototypes for nodeGather.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeGather.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEGATHER_H
+#define NODEGATHER_H
+
+#include "nodes/execnodes.h"
+
+extern GatherState *ExecInitGather(Gather *node, EState *estate, int eflags);
+extern TupleTableSlot *ExecGather(GatherState *node);
+extern void ExecEndGather(GatherState *node);
+extern void ExecShutdownGather(GatherState *node);
+extern void ExecReScanGather(GatherState *node);
+
+#endif   /* NODEGATHER_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ae2f3e..a823f29 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,7 +16,9 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
+#include "executor/tqueue.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -1049,6 +1051,13 @@ typedef struct PlanState
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
 
 	/*
+	 * At execution time, parallel scan descriptor is initialized and stored
+	 * in dynamic shared memory segment by master backend and parallel workers
+	 * retrieve it from shared memory.
+	 */
+	shm_toc    *toc;
+
+	/*
 	 * Other run-time state needed by most if not all node types.
 	 */
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
@@ -1221,6 +1230,29 @@ typedef struct BitmapOrState
 	int			nplans;			/* number of input plans */
 } BitmapOrState;
 
+/* ----------------
+ * GatherState extends PlanState by storing additional information
+ * related to parallel workers.
+ *
+ *		pei					parallel execution info for managing generic state information
+ *							required for parallelism.
+ *		funnel				maintains the runtime information about queue's used to
+ *							receive data from parallel workers.
+ *		any_worker_launched indicates that workers are launched.
+ *		all_workers_done	indicates that all the data from workers has been received.
+ *		local_scan_done		indicates that local scan is compleleted.
+ * ----------------
+ */
+typedef struct GatherState
+{
+	PlanState	ps;				/* its first field is NodeTag */
+	struct ParallelExecutorInfo *pei;
+	TupleQueueFunnel *funnel;
+	bool		any_worker_launched;
+	bool		all_workers_done;
+	bool		local_scan_done;
+} GatherState;
+
 /* ----------------------------------------------------------------
  *				 Scan State Information
  * ----------------------------------------------------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 274480e..c014532 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_Gather,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_GatherState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
@@ -223,6 +225,7 @@ typedef enum NodeTag
 	T_IndexOptInfo,
 	T_ParamPathInfo,
 	T_Path,
+	T_GatherPath,
 	T_IndexPath,
 	T_BitmapHeapPath,
 	T_BitmapAndPath,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 1e2d2bb..844e0f8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -269,6 +269,16 @@ typedef struct BitmapOr
 	List	   *bitmapplans;
 } BitmapOr;
 
+/* ------------
+ *		Gather node
+ * ------------
+ */
+typedef struct Gather
+{
+	Plan		plan;
+	int			num_workers;
+} Gather;
+
 /*
  * ==========
  * Scan nodes
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 961b5d1..20192c9 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -763,6 +763,13 @@ typedef struct Path
 	/* pathkeys is a List of PathKey nodes; see above */
 } Path;
 
+typedef struct GatherPath
+{
+	Path		path;
+	Path	   *subpath;		/* path for each worker */
+	int			num_workers;
+} GatherPath;
+
 /* Macro for extracting a path's parameterization relids; beware double eval */
 #define PATH_REQ_OUTER(path)  \
 	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index dd43e45..b36fbc1 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -26,6 +26,13 @@
 #define DEFAULT_CPU_TUPLE_COST	0.01
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
 #define DEFAULT_CPU_OPERATOR_COST  0.0025
+#define DEFAULT_PARALLEL_TUPLE_COST 0.1
+/*
+ * XXX - We have kept reasonably high value for default parallel
+ * setup cost. In future we might want to change this value based
+ * on results.
+ */
+#define DEFAULT_PARALLEL_SETUP_COST  1000.0
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288	/* measured in pages */
 
@@ -48,8 +55,11 @@ extern PGDLLIMPORT double random_page_cost;
 extern PGDLLIMPORT double cpu_tuple_cost;
 extern PGDLLIMPORT double cpu_index_tuple_cost;
 extern PGDLLIMPORT double cpu_operator_cost;
+extern PGDLLIMPORT double parallel_tuple_cost;
+extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
 extern Cost disable_cost;
+extern int	max_parallel_degree;
 extern bool enable_seqscan;
 extern bool enable_indexscan;
 extern bool enable_indexonlyscan;
@@ -70,6 +80,8 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_gather(GatherPath *path, PlannerInfo *root,
+			RelOptInfo *baserel, ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 161644c..cc00ba5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,9 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern GatherPath *create_gather_path(PlannerInfo *root,
+				   RelOptInfo *rel, Path *subpath, Relids required_outer,
+				   int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/tcop/dest.h b/src/include/tcop/dest.h
index b560672..91acd60 100644
--- a/src/include/tcop/dest.h
+++ b/src/include/tcop/dest.h
@@ -104,7 +104,9 @@ typedef enum
  *		pointers that the executor must call.
  *
  * Note: the receiveSlot routine must be passed a slot containing a TupleDesc
- * identical to the one given to the rStartup routine.
+ * identical to the one given to the rStartup routine.  It returns bool where
+ * a "true" value means "continue processing" and a "false" value means
+ * "stop early, just as if we'd reached the end of the scan".
  * ----------------
  */
 typedef struct _DestReceiver DestReceiver;
@@ -112,7 +114,7 @@ typedef struct _DestReceiver DestReceiver;
 struct _DestReceiver
 {
 	/* Called for each tuple to be output: */
-	void		(*receiveSlot) (TupleTableSlot *slot,
+	bool		(*receiveSlot) (TupleTableSlot *slot,
 											DestReceiver *self);
 	/* Per-executor-run initialization and shutdown: */
 	void		(*rStartup) (DestReceiver *self,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0e149ea..feb821b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -707,6 +707,9 @@ FunctionParameterMode
 FunctionScan
 FunctionScanPerFuncState
 FunctionScanState
+Gather
+GatherPath
+GatherState
 FuzzyAttrMatchState
 GBT_NUMKEY
 GBT_NUMKEY_R
@@ -1195,6 +1198,7 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelExecutorInfo
 PATH
 PBOOL
 PCtxtHandle
#371Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#370)
Re: Parallel Seq Scan

On Wed, Sep 30, 2015 at 11:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

- I don't believe that shm_toc *toc has any business being part of a
generic PlanState node. At most, it should be part of an individual
type of PlanState, like a GatherState or PartialSeqScanState. But
really, I don't see why we need it there at all.

We need it for getting parallelheapscan descriptor in case of
partial sequence scan node, it doesn't seem like a good idea
to retrieve it in begining, as we need to dig into plan tree to
get the node_id for getting the value of parallelheapscan descriptor
from toc.

Now, I think we can surely keep it in PartialSeqScanState or any
other node state which might need it later, but I felt this is quite
generic and we might need to fetch node specific information from toc
going forward.

It's true that the PartialSeqScanState will need a way to get at the
toc, but I don't think that means we should stash it in the PlanState.
I've taken that part out for now.

- I think that a Gather node should inherit from Plan, not Scan. A
Gather node really shouldn't have a scanrelid. Now, admittedly, if
the only thing under the Gather is a Partial Seq Scan, it wouldn't be
totally bonkers to think of the Gather as scanning the same relation
that the Partial Seq Scan is scanning. But in any more complex case,
like where it's scanning a join, you're out of luck. You'll have to
set scanrelid == 0, I suppose, but then, for example, ExecScanReScan
is not going to work. In fact, as far as I can see, the only way
nodeGather.c is actually using any of the generic scan stuff is by
calling ExecInitScanTupleSlot, which is all of one line of code.
ExecEndGather fetches node->ss.ss_currentRelation but then does
nothing with it. So I think this is just a holdover from early
version of this patch where what's now Gather and PartialSeqScan were
a single node, and I think we should rip it out.

makes sense and I think GatherState should also be inherit from PlanState
instead of ScanState which I have changed in patch attached.

You missed a number of things while doing this - I cleaned them up.

- Also, I think this stuff about physical tlists in
create_gather_plan() is bogus. use_physical_tlist is ignorant of the
possibility that the RelOptInfo passed to it might be anything other
than a baserel, and I think it won't be happy if it gets a joinrel.
Moreover, I think our plan here is that, at least for now, the
Gather's tlist will always match the tlist of its child. If that's
so, there's no point to this: it will end up with the same tlist
either way. If any projection is needed, it should be done by the
Gather node's child, not the Gather node itself.

Yes, Gather node itself doesn't need to do projection, but it
needs the projection info to store the same in Slot after fetching
the tuple from tuple queue. Now this is not required for Gather
node itself, but it might be required for any node on top of
Gather node.

Here, I think one thing we could do is that use the subplan's target
list as currently is being done for quals. The only risk is what if
Gating node is added on top of partialseqscan (subplan), but I have checked
that is safe, because Gating plan uses the same target list as it's child.
Also I don't think we need to process any quals at Gather node, so I will
make that as Null, I will do this change in next version unless you see
any problem with it.

Yet another idea is during set_plan_refs(), we can assign leftchild's
target list to parent in case of Gather node (right now it's done in
reverse way which needs to be changed.)

What is your preference?

I made it work like other nodes that inherit their left child's target list.

I made a few other changes as well:

- I wrote documentation for the GUCs. This probably needs to be
expanded once we get the whole feature in, but it's something.

- I added a new single_copy option to the gather. A single-copy
gather never tries to execute the plan itself, unless it can't get any
workers. This is very handy for testing, since it lets you stick a
Gather node on top of an arbitrary plan and, if everything's working,
it should work just as if the Gather node weren't there. I did a bit
of minor fiddling with the contents of the GatherState to make this
work. It's also useful in real life, since somebody can stick a
single-copy Gather node into a plan someplace and run everything below
that in a worker.

- I fixed a bug in ExecGather - you were testing whether
node->pei->pcxt is NULL, which seg faults on the first time through.
The correct thing is to node->pei.

- Assorted cosmetic changes.

- I again left out the early-executor-stop stuff, preferring to leave
that for a separate commit.

That done, I have committed this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#372Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Robert Haas (#368)
1 attachment(s)
Re: Parallel Seq Scan

Hi Robert,

Gather node was oversight by readfunc.c, even though it shall not be
serialized actually.
Also, it used incompatible WRITE_xxx_FIELD() macro on outfuncs.c.

The attached patch fixes both of incomsistence.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Show quoted text

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Wednesday, September 30, 2015 2:19 AM
To: Amit Kapila
Cc: Kaigai Kouhei(海外 浩平); Haribabu Kommi; Gavin Flower; Jeff Davis; Andres
Freund; Amit Langote; Amit Langote; Fabrízio Mello; Thom Brown; Stephen Frost;
pgsql-hackers
Subject: Re: [HACKERS] Parallel Seq Scan

On Tue, Sep 29, 2015 at 12:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch is a rebased patch based on latest commit (d1b7c1ff)
for Gather node.

- I have to reorganize the defines in execParallel.h and .c. To keep
ParallelExecutorInfo, in GatherState node, we need to include execParallel.h
in execnodes.h which was creating compilation issues as execParallel.h
also includes execnodes.h, so for now I have defined ParallelExecutorInfo
in execnodes.h and instrumentation related structures in instrument.h.
- Renamed parallel_seqscan_degree to degree_of_parallelism
- Rename Funnel to Gather
- Removed PARAM_EXEC parameter handling code, I think we can do this
separately.

I have to work more on partial seq scan patch for rebasing it and handling
review comments for the same, so for now I am sending the first part of
patch (which included Gather node functionality and some general support
for parallel-query execution).

Thanks for the fast rebase.

This patch needs a bunch of cleanup:

- The formatting for the GatherState node's comment block is unlike
that of surrounding comment blocks. It lacks the ------- dividers,
and the indentation is not the same. Also, it refers to
ParallelExecutorInfo by the type name, but the other members by
structure member name. The convention is to refer to them by
structure member name, so please do that.

- The naming of fs_workersReady is inconsistent with the other
structure members. The other members use all lower-case names,
separating words with dashes, but this one uses a capital letter. The
other members also don't prefix the names with anything, but this uses
a "fs_" prefix which I assume is left over from when this was called
FunnelState. Finally, this doesn't actually tell you when workers are
ready, just whether they were launched. I suggest we rename this to
"any_worker_launched".

- Instead of moving the declaration of ParallelExecutorInfo, please
just refer to it as "struct ParallelExecutorInfo" in execnodes.h.
That way, you're not sucking these includes into all kinds of places
they don't really need to be.

- Let's not create a new PARALLEL_QUERY category of GUC. Instead,
let's the GUC for the number of workers with under resource usage ->
asynchronous behavior.

- I don't believe that shm_toc *toc has any business being part of a
generic PlanState node. At most, it should be part of an individual
type of PlanState, like a GatherState or PartialSeqScanState. But
really, I don't see why we need it there at all. It should, I think,
only be needed during startup to dig out the information we need. So
we should just dig that stuff out and keep pointers to whatever we
actually need - in this case the ParallelExecutorInfo, I think - in
the particular type of PlanState node that's at issue - here
GatherState. After that we don't need a pointer to the toc any more.

- I'd like to do some renaming of the new GUCs. I suggest we rename
cpu_tuple_comm_cost to parallel_tuple_cost and degree_of_parallelism
to max_parallel_degree.

- I think that a Gather node should inherit from Plan, not Scan. A
Gather node really shouldn't have a scanrelid. Now, admittedly, if
the only thing under the Gather is a Partial Seq Scan, it wouldn't be
totally bonkers to think of the Gather as scanning the same relation
that the Partial Seq Scan is scanning. But in any more complex case,
like where it's scanning a join, you're out of luck. You'll have to
set scanrelid == 0, I suppose, but then, for example, ExecScanReScan
is not going to work. In fact, as far as I can see, the only way
nodeGather.c is actually using any of the generic scan stuff is by
calling ExecInitScanTupleSlot, which is all of one line of code.
ExecEndGather fetches node->ss.ss_currentRelation but then does
nothing with it. So I think this is just a holdover from early
version of this patch where what's now Gather and PartialSeqScan were
a single node, and I think we should rip it out.

- On a related note, the assertions in cost_gather() are both bogus
and should be removed. Similarly with create_gather_plan(). As
previously mentioned, the Gather node should not care what sort of
thing is under it; I am not interested in restricting it to baserels
and then undoing that later.

- For the same reason, cost_gather() should refer to it's third
argument as "rel" not "baserel".

- Also, I think this stuff about physical tlists in
create_gather_plan() is bogus. use_physical_tlist is ignorant of the
possibility that the RelOptInfo passed to it might be anything other
than a baserel, and I think it won't be happy if it gets a joinrel.
Moreover, I think our plan here is that, at least for now, the
Gather's tlist will always match the tlist of its child. If that's
so, there's no point to this: it will end up with the same tlist
either way. If any projection is needed, it should be done by the
Gather node's child, not the Gather node itself.

- Let's rename DestroyParallelSetupAndAccumStats to
ExecShutdownGather. Instead of encasing the entire function in if
statement, let's start with if (node->pei == NULL || node->pei->pcxt
== NULL) return.

- ExecParallelBufferUsageAccum should be declared to take an argument
of type PlanState, not Node. Then you don't have to cast what you are
passing to it, and it doesn't have to cast before calling itself. And,
let's also rename it to ExecShutdownNode and move it to
execProcnode.c. Having a "shutdown phase" that stops a node from
asynchronously consuming additional resources could be useful for
non-parallel node types - especially ForeignScan and CustomScan. And
we could eventually extend this to be called in other situations, like
when a Limit is filled give everything beneath it a chance to ease up.
We don't have to do those bits of work right now but it seems well
worth making this look like a generic facility.

- Calling DestroyParallelSetupAndAccumStats from ExplainNode when we
actually reach the Gather node is much too late. We should really be
shutting down parallel workers at the end of the ExecutorRun phase, or
certainly no later than ExecutorFinish. In fact, you have
standard_ExecutorRun calling ExecParallelBufferUsageAccum() but only
if queryDesc->totaltime is set. What I think you should do instead is
call ExecShutdownNode a few lines earlier, before shutting down the
tuple receiver, and do so unconditionally. That way, the workers are
always shut down in the ExecutorRun phase, which should eliminate the
need for this bit in explain.c.

- The changes to postmaster.c and postgres.c consist of only
additional #includes. Those can, presumably, be reverted.

Other than that, hah hah, it looks pretty cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachments:

pgsql-gather-on-readfunc.v1.patchapplication/octet-stream; name=pgsql-gather-on-readfunc.v1.patchDownload
 src/backend/nodes/outfuncs.c  |  4 ++--
 src/backend/nodes/readfuncs.c | 16 ++++++++++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4645ecb..61dbc75 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -439,8 +439,8 @@ _outGather(StringInfo str, const Gather *node)
 
 	_outPlanInfo(str, (const Plan *) node);
 
-	WRITE_UINT_FIELD(num_workers);
-	WRITE_UINT_FIELD(single_copy);
+	WRITE_INT_FIELD(num_workers);
+	WRITE_BOOL_FIELD(single_copy);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 72368ab..c6af1d5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1999,6 +1999,20 @@ _readUnique(void)
 }
 
 /*
+ * _readGather
+ */
+static Gather *
+_readGather(void)
+{
+	READ_LOCALS(Gather);
+
+	ReadCommonPlan(&local_node->plan);
+
+	READ_INT_FIELD(num_workers);
+	READ_BOOL_FIELD(single_copy);
+}
+
+/*
  * _readHash
  */
 static Hash *
@@ -2365,6 +2379,8 @@ parseNodeString(void)
 		return_value = _readWindowAgg();
 	else if (MATCH("UNIQUE", 6))
 		return_value = _readUnique();
+	else if (MATCH("GATHER", 6))
+		return_value = _readGather();
 	else if (MATCH("HASH", 4))
 		return_value = _readHash();
 	else if (MATCH("SETOP", 5))
#373Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#369)
3 attachment(s)
Re: Parallel Seq Scan

On Wed, Sep 30, 2015 at 7:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 24, 2015 at 2:31 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

[ parallel_seqscan_partialseqscan_v18.patch ]

I spent a bit of time reviewing the heapam.c changes in this patch
this evening, and I think that your attempt to add support for
synchronized scans has some problems.

Thanks for the review and I agree with all the suggestions provided
by you. Fixed all of them in attached patch
(parallel_seqscan_heapscan_v1.patch).

I have rebased partial seq scan patch (as attached with this mail) to
test synchronized scan and parallelheapscan patch. Also I have added
Log (parallel_seqscan_heapscan_test_v1.patch) to see the start positions
during synchronized parallel heap scans. I have done various tests
with parallel scans and found that it works fine for sync scans as well
as without sync scan.

Basic test to verify the patch:
CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 1000000) g;

set parallel_tuple_cost=0.001

set max_parallel_degree=2;

set parallel_setup_cost=0;

SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND
200;

Run the above Select query from multiple clients and notice start scan
positions and Results of the query. It returns the expected results
(Count as 101 rows).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_heapscan_v1.patchapplication/octet-stream; name=parallel_seqscan_heapscan_v1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..a9aa1ed 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,12 +81,16 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan,
+						   bool *pscan_finished);
+static void heap_parallelscan_initialize_startblock(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -226,7 +231,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -272,7 +280,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	else if (allow_sync && synchronize_seqscans)
 	{
 		scan->rs_syncscan = true;
-		scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+		if (scan->rs_parallel != NULL)
+			heap_parallelscan_initialize_startblock(scan);
+		else
+			scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
 	}
 	else
 	{
@@ -496,7 +507,27 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				bool		pscan_finished;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker
+				 * won't need to perform scan.
+				 */
+				if (pscan_finished)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -519,6 +550,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -671,11 +705,22 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				bool		pscan_finished = false;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+				finished = pscan_finished;
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -773,7 +818,27 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				bool		pscan_finished;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker
+				 * won't need to perform scan.
+				 */
+				if (pscan_finished)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -793,6 +858,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -934,11 +1002,22 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				bool		pscan_finished = false;
+
+				page = heap_parallelscan_nextpage(scan, &pscan_finished);
+				finished = pscan_finished;
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1341,7 +1420,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1351,7 +1430,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1360,7 +1439,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1369,7 +1448,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1378,7 +1457,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
 								   false, true, false);
 }
@@ -1386,6 +1465,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
@@ -1418,6 +1498,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1532,6 +1613,166 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = InvalidBlockNumber;
+	target->phs_startblock = InvalidBlockNumber;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize_startblock - initialize the startblock for
+ *					parallel scan.
+ *
+ *		Only the first worker of parallel scan will initialize the start
+ *		block for scan and others will use that information to indicate
+ *		the end of scan.
+ * ----------------
+ */
+static void
+heap_parallelscan_initialize_startblock(HeapScanDesc scan)
+{
+	ParallelHeapScanDesc parallel_scan;
+	BlockNumber page;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	page = parallel_scan->phs_startblock;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	if (page != InvalidBlockNumber)
+		return;					/* some other process already did this */
+
+	page = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	/* even though we checked before, someone might have beaten us here */
+	if (parallel_scan->phs_startblock == InvalidBlockNumber)
+	{
+		parallel_scan->phs_startblock = page;
+		parallel_scan->phs_cblock = page;
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+}
+
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		Scanning till the position from where the parallel scan has started
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ *		Resets the current position for parallel scan to the begining of
+ *		relation, if next page to scan is greater than total number of pages in
+ *		relation.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan,
+						   bool *pscan_finished)
+{
+	BlockNumber page = InvalidBlockNumber;
+	ParallelHeapScanDesc parallel_scan;
+	bool		report_scan_done = false;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	*pscan_finished = false;
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	page = parallel_scan->phs_cblock;
+	if (page != InvalidBlockNumber)
+	{
+		parallel_scan->phs_cblock++;
+		if (parallel_scan->phs_cblock >= scan->rs_nblocks)
+			parallel_scan->phs_cblock = 0;
+		if (parallel_scan->phs_cblock == parallel_scan->phs_startblock)
+		{
+			parallel_scan->phs_cblock = InvalidBlockNumber;
+			report_scan_done = true;
+		}
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	if (page == InvalidBlockNumber)
+		*pscan_finished = true;
+
+	/*
+	 * Report scan location for the first parallel scan to observe the end of
+	 * scan, so that the final state of the position hint is back at the start
+	 * of the rel.
+	 */
+	if (report_scan_done && scan->rs_syncscan)
+		ss_report_location(scan->rs_rd, page);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot	snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL);			/* new scan keys */
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 75e6b72..ead8411 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -121,11 +122,17 @@ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6e62319..1c9ffda 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,16 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	BlockNumber phs_startblock;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +59,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel;	/* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
parallel_seqscan_heapscan_test_v1.patchapplication/octet-stream; name=parallel_seqscan_heapscan_test_v1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c873e04..9ba2020 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -281,7 +281,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	{
 		scan->rs_syncscan = true;
 		if (scan->rs_parallel != NULL)
+		{
 			heap_parallelscan_initialize_startblock(scan);
+			elog(LOG, "heap_parallelscan: start pos %u", scan->rs_parallel->phs_startblock);
+		}
 		else
 			scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
 	}
parallel_seqscan_partialseqscan_v19.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v19.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..c76bfb0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -730,6 +730,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -853,6 +854,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_Gather:
 			pname = sname = "Gather";
 			break;
@@ -1006,6 +1010,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1270,6 +1275,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2354,6 +2360,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..38a92fe 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,8 +21,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 163650c..f2f9c30 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -161,6 +162,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_GatherState:
 			ExecReScanGather((GatherState *) node);
 			break;
@@ -472,6 +477,7 @@ ExecSupportsBackwardScan(Plan *node)
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
 
+		case T_PartialSeqScan:
 		case T_Gather:
 			return false;
 
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..5bd00cc 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_PartialSeqScanState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e6930c1..7db4e41 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -17,6 +17,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -59,6 +60,7 @@ struct SharedExecutorInstrumentation
 typedef struct ExecParallelEstimateContext
 {
 	ParallelContext *pcxt;
+	Size pscan_len;
 	int nnodes;
 } ExecParallelEstimateContext;
 
@@ -67,9 +69,16 @@ typedef struct ExecParallelInitializeDSMContext
 {
 	ParallelContext *pcxt;
 	SharedExecutorInstrumentation *instrumentation;
+	Size pscan_len;
 	int nnodes;
 } ExecParallelInitializeDSMContext;
 
+/*
+ * This is required for parallel plan execution to fetch the information
+ * from dsm.
+ */
+static shm_toc *parallel_shm_toc = NULL;
+
 /* Helper functions that run in the parallel leader. */
 static char *ExecSerializePlan(Plan *plan, EState *estate);
 static bool ExecParallelEstimate(PlanState *node,
@@ -158,10 +167,24 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	/* Count this node. */
 	e->nnodes++;
 
-	/*
-	 * XXX. Call estimators for parallel-aware nodes here, when we have
-	 * some.
-	 */
+	/* Call estimators for parallel-aware nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			{
+				EState	   *estate = ((PartialSeqScanState *) planstate)->ss.ps.state;
+
+				e->pscan_len = heap_parallelscan_estimate(estate->es_snapshot);
+				shm_toc_estimate_chunk(&e->pcxt->estimator, e->pscan_len);
+
+				/* key for partial scan information. */
+				shm_toc_estimate_keys(&e->pcxt->estimator, 1);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
@@ -177,6 +200,8 @@ static bool
 ExecParallelInitializeDSM(PlanState *planstate,
 						  ExecParallelInitializeDSMContext *d)
 {
+	ParallelHeapScanDesc pscan;
+
 	if (planstate == NULL)
 		return false;
 
@@ -196,10 +221,29 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/*
-	 * XXX. Call initializers for parallel-aware plan nodes, when we have
-	 * some.
-	 */
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			{
+				EState	   *estate = ((PartialSeqScanState *) planstate)->ss.ps.state;
+
+				/*
+				 * Store parallel heap scan descriptor in dynamic shared
+				 * memory.
+				 */
+				pscan = shm_toc_allocate(d->pcxt->toc,
+										 d->pscan_len);
+				heap_parallelscan_initialize(pscan,
+					   ((PartialSeqScanState *) planstate)->ss.ss_currentRelation,
+											 estate->es_snapshot);
+				shm_toc_insert(d->pcxt->toc, PARALLEL_KEY_SCAN, pscan);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
 }
@@ -314,6 +358,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	 * a count of how many PlanState nodes there are.
 	 */
 	e.pcxt = pcxt;
+	e.pscan_len = 0;
 	e.nnodes = 0;
 	ExecParallelEstimate(planstate, &e);
 
@@ -379,6 +424,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	 */
 	d.pcxt = pcxt;
 	d.instrumentation = instrumentation;
+	d.pscan_len = e.pscan_len;
 	d.nnodes = 0;
 	ExecParallelInitializeDSM(planstate, &d);
 
@@ -531,6 +577,15 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * GetParallelShmToc
+ */
+shm_toc *
+GetParallelShmToc(void)
+{
+	return parallel_shm_toc;
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
@@ -561,6 +616,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		instrument_options = instrumentation->instrument_options;
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
+	parallel_shm_toc = toc;
+
 	/* Prepare to track buffer usage during query execution. */
 	InstrStartParallelQuery();
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5bc1d48..87b022d 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -309,6 +310,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												  estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_Gather:
 			result = (PlanState *) ExecInitGather((Gather *) node,
 												  estate, eflags);
@@ -511,6 +517,10 @@ ExecProcNode(PlanState *node)
 			result = ExecUnique((UniqueState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_GatherState:
 			result = ExecGather((GatherState *) node);
 			break;
@@ -669,6 +679,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_GatherState:
 			ExecEndGather((GatherState *) node);
 			break;
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 735dbaa..74b3985 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -115,6 +115,8 @@ ExecGather(GatherState *node)
 										 estate,
 								  ((Gather *) (node->ps.plan))->num_workers);
 
+		outerPlanState(node)->toc = node->pei->pcxt->toc;
+
 		/*
 		 * Register backend workers. If the required number of workers are not
 		 * available then we perform the scan with available workers and if
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..e4a125a
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check are
+	 * keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	shm_toc    *toc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it from
+	 * shared memory.  We set 'toc' (place to lookup parallel scan descriptor)
+	 * as retrievied by attaching to dsm for parallel workers whereas master
+	 * backend stores it directly in partial scan state node after
+	 * initializing workers.
+	 */
+	toc = GetParallelShmToc();
+	if (toc)
+		node->ss.ps.toc = toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize it
+	 * during ExecutorStart phase, however we need ParallelHeapScanDesc to
+	 * initialize the scan in case of this node and the same is initialized by
+	 * the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic
+		 * shared memory segment by master backend, parallel workers and local
+		 * scan by master backend retrieve it from shared memory.  If the scan
+		 * descriptor is available on first execution, then we need to
+		 * re-initialize for rescan.
+		 */
+		Assert(node->ss.ps.toc);
+
+		pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b348bfd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -75,6 +75,13 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Result node can be added as a gating node on top of PartialSeqScan
+	 * node, so need to percolate toc information to outer node.
+	 */
+	if (node->ps.toc)
+		outerPlanState(node)->toc = node->ps.toc;
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 88dc085..c1b21b4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -405,6 +405,22 @@ _copySampleScan(const SampleScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4257,6 +4273,9 @@ copyObject(const void *from)
 		case T_Scan:
 			retval = _copyScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_Gather:
 			retval = _copyGather(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4645ecb..c69bf08 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -433,6 +433,14 @@ _outBitmapOr(StringInfo str, const BitmapOr *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outGather(StringInfo str, const Gather *node)
 {
 	WRITE_NODE_TYPE("GATHER");
@@ -3011,6 +3019,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_BitmapOr:
 				_outBitmapOr(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_Gather:
 				_outGather(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 72368ab..3eb5a48 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1622,6 +1622,19 @@ _readSampleScan(void)
 }
 
 /*
+ * _readPartialSeqScan
+ */
+static PartialSeqScan *
+_readPartialSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(PartialSeqScan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
  * _readIndexScan
  */
 static IndexScan *
@@ -2323,6 +2336,8 @@ parseNodeString(void)
 		return_value = _readSeqScan();
 	else if (MATCH("SAMPLESCAN", 10))
 		return_value = _readSampleScan();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readPartialSeqScan();
 	else if (MATCH("INDEXSCAN", 9))
 		return_value = _readIndexScan();
 	else if (MATCH("INDEXONLYSCAN", 13))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1b61fd9..9b0bb8c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -296,6 +296,50 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_partialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan
+	 * via the ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers.
+	 * Here assumption is that disk access cost will also be
+	 * equally shared between workers which is generally true
+	 * unless there are too many workers working on a relatively
+	 * lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for
+	 * partial sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_gather
  *	  Determines and returns the cost of gather path.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..7dce2a1
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (max_parallel_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	num_parallel_workers = Min(max_parallel_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the funnel path which master backend needs to execute. */
+	add_path(rel, (Path *) create_gather_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0ee7392..a334155 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						List *tlist, List *scan_clauses);
 static Gather *create_gather_plan(PlannerInfo *root,
 				   GatherPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
@@ -106,6 +108,8 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+									Index scanrelid);
 static Gather *make_gather(List *qptlist, List *qpqual,
 			int nworkers, bool single_copy, Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
@@ -238,6 +242,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -364,6 +369,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												   scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -568,6 +580,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1110,6 +1123,46 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path)
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan		*scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_gather_plan
  *
  *	  Create a Gather plan for 'best_path' and (recursively) plans
@@ -4771,6 +4824,24 @@ make_unique(Plan *lefttree, List *distinctList)
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static Gather *
 make_gather(List *qptlist,
 			List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index b1cede2..6a82d4b 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -447,6 +447,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 6b32f85..b85e4f6 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2234,6 +2234,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1895a68..68863b9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1308,6 +1308,28 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_partialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_gather_path
  *
  *	  Creates a path corresponding to a gather scan, returning the
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 4fc797a..d171285 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -18,6 +18,8 @@
 #include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 
+#define PARALLEL_KEY_SCAN				UINT64CONST(0xE000000000000006)
+
 typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
 
 typedef struct ParallelExecutorInfo
@@ -32,5 +34,6 @@ typedef struct ParallelExecutorInfo
 extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
 					 EState *estate, int nworkers);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
+extern shm_toc *GetParallelShmToc(void);
 
 #endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cec09ad
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+					   EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b6895f9..9ff8216 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
@@ -1049,6 +1050,13 @@ typedef struct PlanState
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
 
 	/*
+	 * At execution time, parallel scan descriptor is initialized and stored
+	 * in dynamic shared memory segment by master backend and parallel workers
+	 * retrieve it from shared memory.
+	 */
+	shm_toc    *toc;
+
+	/*
 	 * Other run-time state needed by most if not all node types.
 	 */
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
@@ -1273,6 +1281,17 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState		ss;				/* its first field is NodeTag */
+	bool			scan_initialized; /* used to determine if the scan is initialized */
+} PartialSeqScanState;
+
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 94bdb7c..d1feab2 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_PartialSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -100,6 +101,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_PartialSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 1f9213c..1e91ce8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -287,6 +287,12 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
  *		table sample scan node
  * ----------------
  */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 25a7303..f08241a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -75,6 +75,9 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_partialseqscan(Path *path, PlannerInfo *root,
+						RelOptInfo *baserel, ParamPathInfo *param_info,
+						int nworkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7a4940c..cb8ce98 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,8 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						Relids required_outer, int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..e7db9ab 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,14 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+								Relids required_outer);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..5492ba0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1199,6 +1199,8 @@ OverrideStackEntry
 PACE_HEADER
 PACL
 ParallelExecutorInfo
+PartialSeqScan
+PartialSeqScanState
 PATH
 PBOOL
 PCtxtHandle
#374Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#372)
Re: Parallel Seq Scan

On Thu, Oct 1, 2015 at 2:35 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Gather node was oversight by readfunc.c, even though it shall not be
serialized actually.
Also, it used incompatible WRITE_xxx_FIELD() macro on outfuncs.c.

The attached patch fixes both of incomsistence.

Thanks. You missed READ_DONE() but fortunately my compiler noticed
that oversight. Committed with that fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#375Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#373)
Re: Parallel Seq Scan

On Thu, Oct 1, 2015 at 7:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 30, 2015 at 7:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 24, 2015 at 2:31 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

[ parallel_seqscan_partialseqscan_v18.patch ]

I spent a bit of time reviewing the heapam.c changes in this patch
this evening, and I think that your attempt to add support for
synchronized scans has some problems.

Thanks for the review and I agree with all the suggestions provided
by you. Fixed all of them in attached patch
(parallel_seqscan_heapscan_v1.patch).

Thanks.

Does heap_parallelscan_nextpage really need a pscan_finished output
parameter, or can it just return InvalidBlockNumber to indicate end of
scan? I think the latter can be done and would be cleaner.

There doesn't seem to be anything that ensures that everybody who's
scanning the relation agrees on whether we're doing a synchronized
scan. I think that heap_parallelscan_initialize() should taken an
additional Boolean argument, allow_sync. It should decide whether to
actually perform a syncscan using the logic from initscan(), and then
it should store a phs_syncscan flag into the ParallelHeapScanDesc.
heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
regardless of anything else.

I think heap_parallel_rescan() is an unworkable API. When rescanning
a synchronized scan, the existing logic keeps the original start-block
so that the scan's results are reproducible, but no longer reports the
scan position since we're presumably out of step with other backends.
This isn't going to work at all with this API. I don't think you can
swap out the ParallelHeapScanDesc for another one once you've
installed it; the point of a rescan is that you are letting the
HeapScanDesc (or ParallelHeapScanDesc) hold onto some state from the
first time, and that doesn't work at all here. So, I think this
function should just go away, and callers should be able to just use
heap_rescan().

Now this presents a bit of a problem for PartialSeqScan, because, on a
rescan, nodeGather.c completely destroys the DSM and creates a new
one. I think we're going to need to change that. I think we can
adapt the parallel context machinery so that after
WaitForParallelWorkersToFinish(), you can again call
LaunchParallelWorkers(). (That might already work, but I wouldn't be
surprised if it doesn't quite work.) This would make rescans somewhat
more efficient because we wouldn't have to destroy and re-create the
DSM each time. It means that the DSM would have to stick around until
we're totally done with the query, rather than going away when
ExecGather() returns the last tuple, but that doesn't sound too bad.
We can still clean up the workers when we've returned all the tuples,
which I think is the most important thing.

This is obviously going to present some design complications for the
as-yet-uncommitted code to push down PARAM_EXEC parameters, because if
the new value takes more bytes to store than the old value, there
won't be room to update the existing DSM in place. There are several
possible solutions to that problem, but the one that appeals to me
most right now is just don't generate plans that would require that
feature. It doesn't seem particularly appealing to me to put a Gather
node on the inner side of a NestLoop -- at least not until we can
execute that without restarting workers, which we're certainly some
distance from today. And maybe not even then. For initPlans, the
existing code might be adequate, because I think we never re-evaluate
those. And for subPlans, we can potentially handle cases where each
worker can evaluate the subPlan separately below the Gather; we just
can't handle cases where the subPlan attaches above the Gather and is
used below it. Or, we can get around these limitations by redesigning
the PARAM_EXEC pushdown mechanism in some way. But even if we don't,
it's not crippling.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#376Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Robert Haas (#375)
Re: Parallel Seq Scan

On Thu, Oct 1, 2015 at 2:35 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Gather node was oversight by readfunc.c, even though it shall not be
serialized actually.
Also, it used incompatible WRITE_xxx_FIELD() macro on outfuncs.c.

The attached patch fixes both of incomsistence.

Thanks. You missed READ_DONE() but fortunately my compiler noticed
that oversight. Committed with that fix.

I could find one other strangenes, at explain.c.

case T_Gather:
{
Gather *gather = (Gather *) plan;

show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
ExplainPropertyInteger("Number of Workers",
gather->num_workers, es);
if (gather->single_copy)
ExplainPropertyText("Single Copy",
gather->single_copy ? "true" : "false",
es);
}
break;

What is the intention of the last if-check?
The single_copy is checked in the argument of ExplainPropertyText().

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#377Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#376)
Re: Parallel Seq Scan

On Fri, Oct 2, 2015 at 4:27 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Thu, Oct 1, 2015 at 2:35 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Gather node was oversight by readfunc.c, even though it shall not be
serialized actually.
Also, it used incompatible WRITE_xxx_FIELD() macro on outfuncs.c.

The attached patch fixes both of incomsistence.

Thanks. You missed READ_DONE() but fortunately my compiler noticed
that oversight. Committed with that fix.

I could find one other strangenes, at explain.c.

case T_Gather:
{
Gather *gather = (Gather *) plan;

show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
ExplainPropertyInteger("Number of Workers",
gather->num_workers, es);
if (gather->single_copy)
ExplainPropertyText("Single Copy",
gather->single_copy ? "true" : "false",
es);
}
break;

What is the intention of the last if-check?
The single_copy is checked in the argument of ExplainPropertyText().

Oops, that was dumb. single_copy only makes sense if num_workers ==
1, so I intended the if-test to be based on num_workers, not
single_copy. Not sure if we should just make that change now or if
there's a better way to do display it.

I'm sort of tempted to try to come up with a shorthand that only uses
one line in text mode - e.g. set pname to something like "Gather 3" if
there are 3 workers, "Gather 1" if there is worker, or "Gather Single"
if there is one worker and we're in single_copy mode. In non-text
mode, of course, the properties should be displayed separately, for
machine parse-ability.

But maybe I'm getting ahead of myself and we should just change it to
if (gather->num_workers == 1) for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#378Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#375)
Re: Parallel Seq Scan

On Thu, Oct 1, 2015 at 7:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Does heap_parallelscan_nextpage really need a pscan_finished output
parameter, or can it just return InvalidBlockNumber to indicate end of
scan? I think the latter can be done and would be cleaner.

I think we can do that way as well, however we have to keep the check
of page == InvalidBlockNumber at all the callers to indicate finish
of scan which made me write the function as it exists in patch. However,
I don't mind changing it the way you have suggested if you feel that would
be cleaner.

There doesn't seem to be anything that ensures that everybody who's
scanning the relation agrees on whether we're doing a synchronized
scan.

I think that is implicitly ensured. We perform sync scan's based
on GUC synchronize_seqscans and few other conditions in initscan
which I think will ensure that all workers will perform sync scans
on relation. Do you see any problem with exiting logic which would
break the guarantee of sync scans on a relation among parallel
workers?

I think that heap_parallelscan_initialize() should taken an
additional Boolean argument, allow_sync. It should decide whether to
actually perform a syncscan using the logic from initscan(), and then
it should store a phs_syncscan flag into the ParallelHeapScanDesc.
heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
regardless of anything else.

I think this will ensure that any future change in this area won't break the
guarantee for sync scans for parallel workers, is that the reason you
prefer to implement this function in the way suggested by you?

I think heap_parallel_rescan() is an unworkable API. When rescanning
a synchronized scan, the existing logic keeps the original start-block
so that the scan's results are reproducible,

It seems from the code comments in initscan, the reason for keeping
previous startblock is to allow rewinding the cursor which doesn't hold for
parallel scan. We might or might not want to support such cases with
parallel query, but even if we want to there is a way we can do with
current logic (as mentioned in one of my replies below).

but no longer reports the
scan position since we're presumably out of step with other backends.

Is it true for all form of rescans or are you talking about some
case (like SampleScan) in particular? As per my reading of code
(and verified by debugging that code path), it doesn't seem to be true
for rescan in case of seqscan.

This isn't going to work at all with this API. I don't think you can
swap out the ParallelHeapScanDesc for another one once you've
installed it;

Sure, but if we really need some such parameters like startblock position,
then we can preserve those in ScanDesc.

This is obviously going to present some design complications for the
as-yet-uncommitted code to push down PARAM_EXEC parameters, because if
the new value takes more bytes to store than the old value, there
won't be room to update the existing DSM in place.

PARAM_EXEC parameters will be used to support initPlan in parallel query (as
it is done in the initial patch), so it seems to me this is the main
downside of
this approach. I think rather than trying to come up with various possible
solutions for this problem, lets prohibit sync scans with parallel query if
you
see some problem with the suggestions made by me above. Do you see any
main use case getting hit due to non support of sync scans with
parallel seq. scan?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#379Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#378)
Re: Parallel Seq Scan

On Fri, Oct 2, 2015 at 11:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 1, 2015 at 7:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Does heap_parallelscan_nextpage really need a pscan_finished output
parameter, or can it just return InvalidBlockNumber to indicate end of
scan? I think the latter can be done and would be cleaner.

I think we can do that way as well, however we have to keep the check
of page == InvalidBlockNumber at all the callers to indicate finish
of scan which made me write the function as it exists in patch. However,
I don't mind changing it the way you have suggested if you feel that would
be cleaner.

I think it would. I mean, you just end up testing the other thing instead.

I think that heap_parallelscan_initialize() should taken an
additional Boolean argument, allow_sync. It should decide whether to
actually perform a syncscan using the logic from initscan(), and then
it should store a phs_syncscan flag into the ParallelHeapScanDesc.
heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
regardless of anything else.

I think this will ensure that any future change in this area won't break the
guarantee for sync scans for parallel workers, is that the reason you
prefer to implement this function in the way suggested by you?

Basically. It seems pretty fragile the way you have it now.

I think heap_parallel_rescan() is an unworkable API. When rescanning
a synchronized scan, the existing logic keeps the original start-block
so that the scan's results are reproducible,

It seems from the code comments in initscan, the reason for keeping
previous startblock is to allow rewinding the cursor which doesn't hold for
parallel scan. We might or might not want to support such cases with
parallel query, but even if we want to there is a way we can do with
current logic (as mentioned in one of my replies below).

You don't need to rewind a cursor; you just need to restart the scan.
So for example if we were on the inner side of a NestLoop, this would
be a real issue.

but no longer reports the
scan position since we're presumably out of step with other backends.

Is it true for all form of rescans or are you talking about some
case (like SampleScan) in particular? As per my reading of code
(and verified by debugging that code path), it doesn't seem to be true
for rescan in case of seqscan.

I think it is:

if (keep_startblock)
{
/*
* When rescanning, we want to keep the previous startblock setting,
* so that rewinding a cursor doesn't generate surprising results.
* Reset the active syncscan setting, though.
*/
scan->rs_syncscan = (allow_sync && synchronize_seqscans);
}

This isn't going to work at all with this API. I don't think you can
swap out the ParallelHeapScanDesc for another one once you've
installed it;

Sure, but if we really need some such parameters like startblock position,
then we can preserve those in ScanDesc.

Sure, we could transfer the information out of the
ParallelHeapScanDesc and then transfer it back into the new one, but I
have a hard time thinking that's a good design.

PARAM_EXEC parameters will be used to support initPlan in parallel query (as
it is done in the initial patch), so it seems to me this is the main
downside of
this approach. I think rather than trying to come up with various possible
solutions for this problem, lets prohibit sync scans with parallel query if
you
see some problem with the suggestions made by me above. Do you see any
main use case getting hit due to non support of sync scans with
parallel seq. scan?

Yes. Synchronized scans are particularly important with large tables,
and those are the kind you're likely to want to use a parallel
sequential scan on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#380Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#379)
Re: Parallel Seq Scan

On Sat, Oct 3, 2015 at 11:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 2, 2015 at 11:44 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Oct 1, 2015 at 7:41 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

Does heap_parallelscan_nextpage really need a pscan_finished output
parameter, or can it just return InvalidBlockNumber to indicate end of
scan? I think the latter can be done and would be cleaner.

I think we can do that way as well, however we have to keep the check
of page == InvalidBlockNumber at all the callers to indicate finish
of scan which made me write the function as it exists in patch.

However,

I don't mind changing it the way you have suggested if you feel that

would

be cleaner.

I think it would. I mean, you just end up testing the other thing

instead.

No issues, will change in next version of patch.

I think that heap_parallelscan_initialize() should taken an
additional Boolean argument, allow_sync. It should decide whether to
actually perform a syncscan using the logic from initscan(), and then
it should store a phs_syncscan flag into the ParallelHeapScanDesc.
heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
regardless of anything else.

I think this will ensure that any future change in this area won't

break the

guarantee for sync scans for parallel workers, is that the reason you
prefer to implement this function in the way suggested by you?

Basically. It seems pretty fragile the way you have it now.

Okay, in that case I will change it as per your suggestion.

I think heap_parallel_rescan() is an unworkable API. When rescanning
a synchronized scan, the existing logic keeps the original start-block
so that the scan's results are reproducible,

It seems from the code comments in initscan, the reason for keeping
previous startblock is to allow rewinding the cursor which doesn't hold

for

parallel scan. We might or might not want to support such cases with
parallel query, but even if we want to there is a way we can do with
current logic (as mentioned in one of my replies below).

You don't need to rewind a cursor; you just need to restart the scan.
So for example if we were on the inner side of a NestLoop, this would
be a real issue.

Sorry, but I am not able to see the exact issue you have in mind for
NestLoop,
if we don't preserve the start block position for parallel scan. The code
in
discussion is added by below commit which doesn't indicate any such problem.

******
commit 61dd4185ffb034a22b4b40425d56fe37e7178488
Author: Tom Lane <tgl@sss.pgh.pa.us> Thu Jun 11 00:24:16 2009
Committer: Tom Lane <tgl@sss.pgh.pa.us> Thu Jun 11 00:24:16 2009

Keep rs_startblock the same during heap_rescan, so that a rescan of a
SeqScan
node starts from the same place as the first scan did. This avoids
surprising
behavior of scrollable and WITH HOLD cursors, as seen in Mark Kirkwood's bug
report of yesterday.

It's not entirely clear whether a rescan should be forced to drop out of the
syncscan mode, but for the moment I left the code behaving the same on that
point. Any change there would only be a performance and not a correctness
issue, anyway.
******

but no longer reports the
scan position since we're presumably out of step with other backends.

Is it true for all form of rescans or are you talking about some
case (like SampleScan) in particular? As per my reading of code
(and verified by debugging that code path), it doesn't seem to be true
for rescan in case of seqscan.

I think it is:

if (keep_startblock)
{
/*
* When rescanning, we want to keep the previous startblock

setting,

* so that rewinding a cursor doesn't generate surprising results.
* Reset the active syncscan setting, though.
*/
scan->rs_syncscan = (allow_sync && synchronize_seqscans);
}

Okay, but this code doesn't indicate that scan position will not be
reported.
It just resets the flag based on which scan position is reported. It won't
change during rescan as compare to first scan unless the value of
synchronize_seqscans is changed in-between. So under some special
circumstances, it may change but not as a general rule.

This isn't going to work at all with this API. I don't think you can
swap out the ParallelHeapScanDesc for another one once you've
installed it;

Sure, but if we really need some such parameters like startblock

position,

then we can preserve those in ScanDesc.

Sure, we could transfer the information out of the
ParallelHeapScanDesc and then transfer it back into the new one, but I
have a hard time thinking that's a good design.

I also don't think that is the best way to achieve it, but on the other hand
doing it the other way will pose some restrictions which doesn't matter much
for the current cases, but has the potential to pose similar restrictions
to support
rescan for future operations (assume a case where the scan has to be done
by worker based on some key which will be regenerated for each rescan). I
am not telling that we can't find some way to support the same, but the need
of doing so doesn't seem to be big enough.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#381Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#380)
2 attachment(s)
Re: Parallel Seq Scan

On Sun, Oct 4, 2015 at 10:21 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sat, Oct 3, 2015 at 11:35 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Fri, Oct 2, 2015 at 11:44 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Oct 1, 2015 at 7:41 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

Does heap_parallelscan_nextpage really need a pscan_finished output
parameter, or can it just return InvalidBlockNumber to indicate end

of

scan? I think the latter can be done and would be cleaner.

I think we can do that way as well, however we have to keep the check
of page == InvalidBlockNumber at all the callers to indicate finish
of scan which made me write the function as it exists in patch.

However,

I don't mind changing it the way you have suggested if you feel that

would

be cleaner.

I think it would. I mean, you just end up testing the other thing

instead.

No issues, will change in next version of patch.

Changed as per suggestion.

I think that heap_parallelscan_initialize() should taken an
additional Boolean argument, allow_sync. It should decide whether to
actually perform a syncscan using the logic from initscan(), and then
it should store a phs_syncscan flag into the ParallelHeapScanDesc.
heap_beginscan_internal should set rs_syncscan based on phs_syncscan,
regardless of anything else.

I think this will ensure that any future change in this area won't

break the

guarantee for sync scans for parallel workers, is that the reason you
prefer to implement this function in the way suggested by you?

Basically. It seems pretty fragile the way you have it now.

Okay, in that case I will change it as per your suggestion.

Changed as per suggestion.

I think heap_parallel_rescan() is an unworkable API. When rescanning
a synchronized scan, the existing logic keeps the original

start-block

so that the scan's results are reproducible,

It seems from the code comments in initscan, the reason for keeping
previous startblock is to allow rewinding the cursor which doesn't

hold for

parallel scan. We might or might not want to support such cases with
parallel query, but even if we want to there is a way we can do with
current logic (as mentioned in one of my replies below).

You don't need to rewind a cursor; you just need to restart the scan.
So for example if we were on the inner side of a NestLoop, this would
be a real issue.

Sorry, but I am not able to see the exact issue you have in mind for

NestLoop,

if we don't preserve the start block position for parallel scan.

For now, I have fixed this by not preserving the startblock incase of rescan
for parallel scan. Note that, I have created a separate patch
(parallel_seqscan_heaprescan_v1.patch) for support of rescan (for parallel
scan).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_heapscan_v2.patchapplication/octet-stream; name=parallel_seqscan_heapscan_v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..4e913bd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,12 +81,15 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan);
+static void heap_parallelscan_initialize_startblock(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -226,16 +230,21 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
-	 * strategy and enable synchronized scanning (see syncscan.c).  Although
-	 * the thresholds for these features could be different, we make them the
-	 * same so that there are only two behaviors to tune rather than four.
-	 * (However, some callers need to be able to disable one or both of these
-	 * behaviors, independently of the size of the table; also there is a GUC
-	 * variable that can disable synchronized scanning.)
+	 * strategy and enable synchronized scanning (see syncscan.c, if the
+	 * condition to enable sync syncscans is changed here, then do the same
+	 * change in heap_parallelscan_initialize()).  Although the thresholds for
+	 * these features could be different, we make them the same so that there
+	 * are only two behaviors to tune rather than four.  (However, some
+	 * callers need to be able to disable one or both of these behaviors,
+	 * independently of the size of the table; also there is a GUC variable
+	 * that can disable synchronized scanning.)
 	 *
 	 * During a rescan, don't make a new strategy object if we don't have to.
 	 */
@@ -272,7 +281,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	else if (allow_sync && synchronize_seqscans)
 	{
 		scan->rs_syncscan = true;
-		scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+		if (scan->rs_parallel != NULL)
+			heap_parallelscan_initialize_startblock(scan);
+		else
+			scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
 	}
 	else
 	{
@@ -496,7 +508,25 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker
+				 * won't need to perform scan.
+				 */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -519,6 +549,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -671,11 +704,20 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+				finished = (page == InvalidBlockNumber);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -773,7 +815,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker
+				 * won't need to perform scan.
+				 */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -793,6 +853,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -934,11 +997,20 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+				finished = (page == InvalidBlockNumber);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1341,7 +1413,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1351,7 +1423,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1360,7 +1432,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1369,7 +1441,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1378,7 +1450,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
 								   false, true, false);
 }
@@ -1386,6 +1458,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
@@ -1418,6 +1491,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1452,6 +1526,13 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	/*
+	 * Ensure all the backends participating in parallel scan must share the
+	 * syncscan property.
+	 */
+	if (parallel_scan)
+		scan->rs_syncscan = parallel_scan->phs_syncscan;
+
 	return scan;
 }
 
@@ -1532,6 +1613,165 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot, bool allow_sync)
+{
+	bool		check_sync_allowed;
+
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = InvalidBlockNumber;
+	target->phs_startblock = InvalidBlockNumber;
+
+	/*
+	 * If the table is large relative to NBuffers, enable synchronized
+	 * scanning (see syncscan.c).
+	 */
+	if (!RelationUsesLocalBuffers(relation) &&
+		target->phs_nblocks > NBuffers / 4)
+		check_sync_allowed = allow_sync;
+	else
+		check_sync_allowed = false;
+
+	if (check_sync_allowed && synchronize_seqscans)
+		target->phs_syncscan = true;
+	else
+		target->phs_syncscan = false;
+
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize_startblock - initialize the startblock for
+ *					parallel scan.
+ *
+ *		Only the first worker of parallel scan will initialize the start
+ *		block for scan and others will use that information to indicate
+ *		the end of scan.
+ * ----------------
+ */
+static void
+heap_parallelscan_initialize_startblock(HeapScanDesc scan)
+{
+	ParallelHeapScanDesc parallel_scan;
+	BlockNumber page;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	page = parallel_scan->phs_startblock;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	if (page != InvalidBlockNumber)
+		return;					/* some other process already did this */
+
+	page = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	/* even though we checked before, someone might have beaten us here */
+	if (parallel_scan->phs_startblock == InvalidBlockNumber)
+	{
+		parallel_scan->phs_startblock = page;
+		parallel_scan->phs_cblock = page;
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+}
+
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		Scanning till the position from where the parallel scan has started
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ *		Resets the current position for parallel scan to the begining of
+ *		relation, if next page to scan is greater than total number of pages in
+ *		relation.
+ *
+ *		Return value InvalidBlockNumber indicates end of scan.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan)
+{
+	BlockNumber page = InvalidBlockNumber;
+	ParallelHeapScanDesc parallel_scan;
+	bool		report_scan_done = false;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	page = parallel_scan->phs_cblock;
+	if (page != InvalidBlockNumber)
+	{
+		parallel_scan->phs_cblock++;
+		if (parallel_scan->phs_cblock >= scan->rs_nblocks)
+			parallel_scan->phs_cblock = 0;
+		if (parallel_scan->phs_cblock == parallel_scan->phs_startblock)
+		{
+			parallel_scan->phs_cblock = InvalidBlockNumber;
+			report_scan_done = true;
+		}
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	/*
+	 * Report scan location for the first parallel scan to observe the end of
+	 * scan, so that the final state of the position hint is back at the start
+	 * of the rel.
+	 */
+	if (report_scan_done && scan->rs_syncscan)
+		ss_report_location(scan->rs_rd, page);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot	snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 75e6b72..98a586d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -126,6 +127,12 @@ extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot,
+							 bool allow_sync);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6e62319..ecc6934 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,17 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	BlockNumber phs_startblock;
+	bool		phs_syncscan;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +60,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel;	/* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
parallel_seqscan_heaprescan_v1.patchapplication/octet-stream; name=parallel_seqscan_heaprescan_v1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4e913bd..0f0cd1f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1542,7 +1542,8 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
  */
 void
 heap_rescan(HeapScanDesc scan,
-			ScanKey key)
+			ScanKey key,
+			bool keep_startblock)
 {
 	/*
 	 * unpin scan buffers
@@ -1553,7 +1554,7 @@ heap_rescan(HeapScanDesc scan,
 	/*
 	 * reinitialize scan descriptor
 	 */
-	initscan(scan, key, true);
+	initscan(scan, key, keep_startblock);
 }
 
 /* ----------------
@@ -1574,7 +1575,7 @@ heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_pageatatime = allow_pagemode && IsMVCCSnapshot(scan->rs_snapshot);
 	/* ... and rescan */
-	heap_rescan(scan, key);
+	heap_rescan(scan, key, true);
 }
 
 /* ----------------
@@ -1772,6 +1773,29 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL,			/* new scan keys */
+				false);			/* don't preserve start block */
+
+	/*
+	 * Ensure all the backends participating in parallel scan must share the
+	 * syncscan property.
+	 */
+	if (pscan)
+		scan->rs_syncscan = pscan->phs_syncscan;
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c784b9e..9131dae 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -452,7 +452,7 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 
 	/* rescan to release any page pin */
-	heap_rescan(node->ss.ss_currentScanDesc, NULL);
+	heap_rescan(node->ss.ss_currentScanDesc, NULL, true);
 
 	if (node->tbmiterator)
 		tbm_end_iterate(node->tbmiterator);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..75607b2 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -268,7 +268,8 @@ ExecReScanSeqScan(SeqScanState *node)
 	scan = node->ss_currentScanDesc;
 
 	heap_rescan(scan,			/* scan desc */
-				NULL);			/* new scan keys */
+				NULL,			/* new scan keys */
+				true);			/* preserve start block */
 
 	ExecScanReScan((ScanState *) node);
 }
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 98a586d..894ee3f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -121,9 +121,10 @@ extern HeapScanDesc heap_beginscan_sampling(Relation relation,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
-extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_rescan(HeapScanDesc scan, ScanKey key, bool keep_startblock);
 extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
#382Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#366)
Re: Parallel Seq Scan

On Sat, Sep 26, 2015 at 04:09:12PM -0400, Robert Haas wrote:

+/*-------------------------------------------------------------------------
+ * datumSerialize
+ *
+ * Serialize a possibly-NULL datum into caller-provided storage.
+void
+datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
+			   char **start_address)
+{
+	int		header;
+
+	/* Write header word. */
+	if (isnull)
+		header = -2;
+	else if (typByVal)
+		header = -1;
+	else
+		header = datumGetSize(value, typByVal, typLen);
+	memcpy(*start_address, &header, sizeof(int));
+	*start_address += sizeof(int);
+
+	/* If not null, write payload bytes. */
+	if (!isnull)
+	{
+		if (typByVal)
+		{
+			memcpy(*start_address, &value, sizeof(Datum));
+			*start_address += sizeof(Datum);
+		}
+		else
+		{
+			memcpy(*start_address, DatumGetPointer(value), header);
+			*start_address += header;
+		}
+	}
+}

I see no mention in this thread of varatt_indirect, but I anticipated
datumSerialize() reacting to it the same way datumCopy() reacts. If
datumSerialize() can get away without doing so, why is that?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#383Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#381)
Re: Parallel Seq Scan

On Mon, Oct 5, 2015 at 11:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

For now, I have fixed this by not preserving the startblock incase of rescan
for parallel scan. Note that, I have created a separate patch
(parallel_seqscan_heaprescan_v1.patch) for support of rescan (for parallel
scan).

while testing parallel seqscan, My colleague Jing Wang has found a problem in
parallel_seqscan_heapscan_v2.patch.

In function initscan, the allow_sync flag is set to false as the
number of pages in the
table are less than NBuffers/4.

if (!RelationUsesLocalBuffers(scan->rs_rd) &&
scan->rs_nblocks > NBuffers / 4)

As allow_sync flag is false, the function
heap_parallelscan_initialize_startblock is not
called in initscan function to initialize the
parallel_scan->phs_cblock parameter. Because
of this reason while getting the next page in
heap_parallelscan_nextpage, it returns
InvalidBlockNumber, thus it ends the scan without returning the results.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#384Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#383)
2 attachment(s)
Re: Parallel Seq Scan

On Mon, Oct 12, 2015 at 11:51 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Mon, Oct 5, 2015 at 11:20 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

For now, I have fixed this by not preserving the startblock incase of

rescan

for parallel scan. Note that, I have created a separate patch
(parallel_seqscan_heaprescan_v1.patch) for support of rescan (for

parallel

scan).

while testing parallel seqscan, My colleague Jing Wang has found a

problem in

parallel_seqscan_heapscan_v2.patch.

Thanks for spotting the issue.

In function initscan, the allow_sync flag is set to false as the
number of pages in the
table are less than NBuffers/4.

if (!RelationUsesLocalBuffers(scan->rs_rd) &&
scan->rs_nblocks > NBuffers / 4)

As allow_sync flag is false, the function
heap_parallelscan_initialize_startblock is not
called in initscan function to initialize the
parallel_scan->phs_cblock parameter. Because
of this reason while getting the next page in
heap_parallelscan_nextpage, it returns
InvalidBlockNumber, thus it ends the scan without returning the results.

Right, it should initialize parallel scan properly even for non-synchronized
scans. Fixed the issue in attached patch. Rebased heap rescan is
attached as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_heapscan_v3.patchapplication/octet-stream; name=parallel_seqscan_heapscan_v3.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..c4413e6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,12 +81,16 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan);
+static void heap_parallelscan_initialize_startblock(HeapScanDesc scan,
+										bool allow_sync);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -226,16 +231,21 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
-	 * strategy and enable synchronized scanning (see syncscan.c).  Although
-	 * the thresholds for these features could be different, we make them the
-	 * same so that there are only two behaviors to tune rather than four.
-	 * (However, some callers need to be able to disable one or both of these
-	 * behaviors, independently of the size of the table; also there is a GUC
-	 * variable that can disable synchronized scanning.)
+	 * strategy and enable synchronized scanning (see syncscan.c, if the
+	 * condition to enable sync syncscans is changed here, then do the same
+	 * change in heap_parallelscan_initialize()).  Although the thresholds for
+	 * these features could be different, we make them the same so that there
+	 * are only two behaviors to tune rather than four.  (However, some
+	 * callers need to be able to disable one or both of these behaviors,
+	 * independently of the size of the table; also there is a GUC variable
+	 * that can disable synchronized scanning.)
 	 *
 	 * During a rescan, don't make a new strategy object if we don't have to.
 	 */
@@ -272,12 +282,18 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	else if (allow_sync && synchronize_seqscans)
 	{
 		scan->rs_syncscan = true;
-		scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+		if (scan->rs_parallel != NULL)
+			heap_parallelscan_initialize_startblock(scan, allow_sync);
+		else
+			scan->rs_startblock = ss_get_location(scan->rs_rd, scan->rs_nblocks);
 	}
 	else
 	{
 		scan->rs_syncscan = false;
-		scan->rs_startblock = 0;
+		if (scan->rs_parallel != NULL)
+			heap_parallelscan_initialize_startblock(scan, allow_sync);
+		else
+			scan->rs_startblock = 0;
 	}
 
 	scan->rs_numblocks = InvalidBlockNumber;
@@ -496,7 +512,25 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker
+				 * won't need to perform scan.
+				 */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -519,6 +553,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -671,11 +708,20 @@ heapgettup(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+				finished = (page == InvalidBlockNumber);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -773,7 +819,25 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/*
+				 * Return NULL if the scan is finished. It can so happen that
+				 * by the time one of workers started the scan, others have
+				 * already completed scanning the relation, so this worker
+				 * won't need to perform scan.
+				 */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -793,6 +857,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -934,11 +1001,20 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		else
 		{
-			page++;
-			if (page >= scan->rs_nblocks)
-				page = 0;
-			finished = (page == scan->rs_startblock) ||
-				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+				finished = (page == InvalidBlockNumber);
+			}
+			else
+			{
+				page++;
+				if (page >= scan->rs_nblocks)
+					page = 0;
+
+				finished = (page == scan->rs_startblock) ||
+					(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
+			}
 
 			/*
 			 * Report our new scan position for synchronization purposes. We
@@ -1341,7 +1417,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1351,7 +1427,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1360,7 +1436,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1369,7 +1445,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1378,7 +1454,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
 								   false, true, false);
 }
@@ -1386,6 +1462,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
@@ -1418,6 +1495,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1452,6 +1530,13 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	/*
+	 * Ensure all the backends participating in parallel scan must share the
+	 * syncscan property.
+	 */
+	if (parallel_scan)
+		scan->rs_syncscan = parallel_scan->phs_syncscan;
+
 	return scan;
 }
 
@@ -1532,6 +1617,168 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot, bool allow_sync)
+{
+	bool		check_sync_allowed;
+
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = InvalidBlockNumber;
+	target->phs_startblock = InvalidBlockNumber;
+
+	/*
+	 * If the table is large relative to NBuffers, enable synchronized
+	 * scanning (see syncscan.c).
+	 */
+	if (!RelationUsesLocalBuffers(relation) &&
+		target->phs_nblocks > NBuffers / 4)
+		check_sync_allowed = allow_sync;
+	else
+		check_sync_allowed = false;
+
+	if (check_sync_allowed && synchronize_seqscans)
+		target->phs_syncscan = true;
+	else
+		target->phs_syncscan = false;
+
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize_startblock - initialize the startblock for
+ *					parallel scan.
+ *
+ *		Only the first worker of parallel scan will initialize the start
+ *		block for scan and others will use that information to indicate
+ *		the end of scan.
+ * ----------------
+ */
+static void
+heap_parallelscan_initialize_startblock(HeapScanDesc scan, bool allow_sync)
+{
+	ParallelHeapScanDesc parallel_scan;
+	BlockNumber page;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	page = parallel_scan->phs_startblock;
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	if (page != InvalidBlockNumber)
+		return;					/* some other process already did this */
+
+	if (allow_sync)
+		page = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+	else
+		page = 0;
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	/* even though we checked before, someone might have beaten us here */
+	if (parallel_scan->phs_startblock == InvalidBlockNumber)
+	{
+		parallel_scan->phs_startblock = page;
+		parallel_scan->phs_cblock = page;
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+}
+
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		Scanning till the position from where the parallel scan has started
+ *		indicates end of scan.  Note, however, that other backends could still
+ *		be scanning if they grabbed a page to scan and aren't done with it yet.
+ *		Resets the current position for parallel scan to the begining of
+ *		relation, if next page to scan is greater than total number of pages in
+ *		relation.
+ *
+ *		Return value InvalidBlockNumber indicates end of scan.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan)
+{
+	BlockNumber page = InvalidBlockNumber;
+	ParallelHeapScanDesc parallel_scan;
+	bool		report_scan_done = false;
+
+	Assert(scan->rs_parallel);
+
+	parallel_scan = scan->rs_parallel;
+
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+	page = parallel_scan->phs_cblock;
+	if (page != InvalidBlockNumber)
+	{
+		parallel_scan->phs_cblock++;
+		if (parallel_scan->phs_cblock >= scan->rs_nblocks)
+			parallel_scan->phs_cblock = 0;
+		if (parallel_scan->phs_cblock == parallel_scan->phs_startblock)
+		{
+			parallel_scan->phs_cblock = InvalidBlockNumber;
+			report_scan_done = true;
+		}
+	}
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	/*
+	 * Report scan location for the first parallel scan to observe the end of
+	 * scan, so that the final state of the position hint is back at the start
+	 * of the rel.
+	 */
+	if (report_scan_done && scan->rs_syncscan)
+		ss_report_location(scan->rs_rd, page);
+
+	return page;
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot	snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 75e6b72..98a586d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -126,6 +127,12 @@ extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot,
+							 bool allow_sync);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6e62319..ecc6934 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,17 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/* Struct for parallel scan setup */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;
+	BlockNumber phs_nblocks;
+	slock_t		phs_mutex;
+	BlockNumber phs_cblock;
+	BlockNumber phs_startblock;
+	bool		phs_syncscan;
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +60,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel;	/* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
parallel_seqscan_heaprescan_v2.patchapplication/octet-stream; name=parallel_seqscan_heaprescan_v2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94d5a70..025475e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1546,7 +1546,8 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
  */
 void
 heap_rescan(HeapScanDesc scan,
-			ScanKey key)
+			ScanKey key,
+			bool keep_startblock)
 {
 	/*
 	 * unpin scan buffers
@@ -1557,7 +1558,7 @@ heap_rescan(HeapScanDesc scan,
 	/*
 	 * reinitialize scan descriptor
 	 */
-	initscan(scan, key, true);
+	initscan(scan, key, keep_startblock);
 }
 
 /* ----------------
@@ -1578,7 +1579,7 @@ heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_pageatatime = allow_pagemode && IsMVCCSnapshot(scan->rs_snapshot);
 	/* ... and rescan */
-	heap_rescan(scan, key);
+	heap_rescan(scan, key, true);
 }
 
 /* ----------------
@@ -1779,6 +1780,29 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 }
 
 /* ----------------
+ *		heap_parallel_rescan		- restart a parallel relation scan
+ * ----------------
+ */
+void
+heap_parallel_rescan(ParallelHeapScanDesc pscan,
+					 HeapScanDesc scan)
+{
+	if (pscan != NULL)
+		scan->rs_parallel = pscan;
+
+	heap_rescan(scan,			/* scan desc */
+				NULL,			/* new scan keys */
+				false);			/* don't preserve start block */
+
+	/*
+	 * Ensure all the backends participating in parallel scan must share the
+	 * syncscan property.
+	 */
+	if (pscan)
+		scan->rs_syncscan = pscan->phs_syncscan;
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c784b9e..9131dae 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -452,7 +452,7 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	PlanState  *outerPlan = outerPlanState(node);
 
 	/* rescan to release any page pin */
-	heap_rescan(node->ss.ss_currentScanDesc, NULL);
+	heap_rescan(node->ss.ss_currentScanDesc, NULL, true);
 
 	if (node->tbmiterator)
 		tbm_end_iterate(node->tbmiterator);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..75607b2 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -268,7 +268,8 @@ ExecReScanSeqScan(SeqScanState *node)
 	scan = node->ss_currentScanDesc;
 
 	heap_rescan(scan,			/* scan desc */
-				NULL);			/* new scan keys */
+				NULL,			/* new scan keys */
+				true);			/* preserve start block */
 
 	ExecScanReScan((ScanState *) node);
 }
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 98a586d..894ee3f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -121,9 +121,10 @@ extern HeapScanDesc heap_beginscan_sampling(Relation relation,
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
-extern void heap_rescan(HeapScanDesc scan, ScanKey key);
+extern void heap_rescan(HeapScanDesc scan, ScanKey key, bool keep_startblock);
 extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
+extern void heap_parallel_rescan(ParallelHeapScanDesc pscan, HeapScanDesc scan);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
#385Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#382)
1 attachment(s)
Re: Parallel Seq Scan

On Sun, Oct 11, 2015 at 7:56 PM, Noah Misch <noah@leadboat.com> wrote:

I see no mention in this thread of varatt_indirect, but I anticipated
datumSerialize() reacting to it the same way datumCopy() reacts. If
datumSerialize() can get away without doing so, why is that?

Good point. I don't think it can. Attached is a patch to fix that.
This patch also includes some somewhat-related changes to
plpgsql_param_fetch() upon which I would appreciate any input you can
provide.

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this
test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list. Here, I've taken the approach of making that
check unconditional, which seems to work, but I'm not sure if some
other approach would be better, such as adding an additional Boolean
(or enum context?) argument to ParamFetchHook. I *think* that
skipping this check is merely a performance optimization rather than
anything that affects correctness, and bms_is_member() is pretty
cheap, so perhaps the way I've done it is OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

copy-paramlistinfo-fixes.patchapplication/x-patch; name=copy-paramlistinfo-fixes.patchDownload
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 3d9e354..0d61950 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -264,6 +264,11 @@ datumEstimateSpace(Datum value, bool isnull, bool typByVal, int typLen)
 		/* no need to use add_size, can't overflow */
 		if (typByVal)
 			sz += sizeof(Datum);
+		else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+		{
+			ExpandedObjectHeader *eoh = DatumGetEOHP(value);
+			sz += EOH_get_flat_size(eoh);
+		}
 		else
 			sz += datumGetSize(value, typByVal, typLen);
 	}
@@ -292,6 +297,7 @@ void
 datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 			   char **start_address)
 {
+	ExpandedObjectHeader *eoh = NULL;
 	int		header;
 
 	/* Write header word. */
@@ -299,6 +305,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 		header = -2;
 	else if (typByVal)
 		header = -1;
+	else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+	{
+		eoh = DatumGetEOHP(value);
+		header = EOH_get_flat_size(eoh);
+	}
 	else
 		header = datumGetSize(value, typByVal, typLen);
 	memcpy(*start_address, &header, sizeof(int));
@@ -312,6 +323,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 			memcpy(*start_address, &value, sizeof(Datum));
 			*start_address += sizeof(Datum);
 		}
+		else if (eoh)
+		{
+			EOH_flatten_into(eoh, (void *) *start_address, header);
+			*start_address += header;
+		}
 		else
 		{
 			memcpy(*start_address, DatumGetPointer(value), header);
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index c73f20b..346e8f8 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -5696,21 +5696,17 @@ plpgsql_param_fetch(ParamListInfo params, int paramid)
 	/* now we can access the target datum */
 	datum = estate->datums[dno];
 
-	/* need to behave slightly differently for shared and unshared arrays */
-	if (params != estate->paramLI)
-	{
-		/*
-		 * We're being called, presumably from copyParamList(), for cursor
-		 * parameters.  Since copyParamList() will try to materialize every
-		 * single parameter slot, it's important to do nothing when asked for
-		 * a datum that's not supposed to be used by this SQL expression.
-		 * Otherwise we risk failures in exec_eval_datum(), not to mention
-		 * possibly copying a lot more data than the cursor actually uses.
-		 */
-		if (!bms_is_member(dno, expr->paramnos))
-			return;
-	}
-	else
+	/*
+	 * Since copyParamList() and SerializeParamList() will try to materialize
+	 * every single parameter slot, it's important to do nothing when asked for
+	 * a datum that's not supposed to be used by this SQL expression.
+	 * Otherwise we risk failures in exec_eval_datum(), not to mention
+	 * possibly copying a lot more data than the cursor actually uses.
+	 */
+	if (!bms_is_member(dno, expr->paramnos))
+		return;
+
+	if (params == estate->paramLI)
 	{
 		/*
 		 * Normal evaluation cases.  We don't need to sanity-check dno, but we
#386Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#384)
1 attachment(s)
Re: Parallel Seq Scan

On Mon, Oct 12, 2015 at 5:15 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Right, it should initialize parallel scan properly even for

non-synchronized

scans. Fixed the issue in attached patch. Rebased heap rescan is
attached as well.

Attached is rebased patch for partial seqscan support. The major
change in this patch is to prohibit generation of parallel path for a
relation if quals contain restricted functions and or initplan/subplan.
Also as Gather node in itself is not a projection capable node, so
if target list contains any expression, it adds a Result node on top of
it. I think this will take care of the cases where if target list contains
any parallel-unsafe expressions (like restricted functions and or
initplans/subplans), then those won't be pushed to backend workers.

Another options I have considered for target list are:
1. Assess the tlist passed to query_planner to see if it contains any
parallel-
unsafe expression, if so then don't generate any parallel path for that
subquery. Though this idea will deal with prohibition at sub-query level,
still I think it is not the best way as subquery could contain join and for
some of the relations participating in join, we could have parallel-paths,
but
doing this way will restrict parallel paths for all the relations
participating in
sub-query.

2. To handle join case in sub-uery, we can pass tlist passed to
query_planner() till create_parallelscan_paths() and then check if any
target
entry contains unsafe expression and if that expression has Var that belongs
to current relation, then don't allow parallel path else allow it. Doing
this way
we might not be able to catch the cases as below, where expression in
target doesn't belong to any relation.

select c1, pg_restricted() from t1;

We can think of other ways to handle target list containing parallel-unsafe
expression, if whatever done in patch is not sufficient.

We might want to support initplans/subplans and restricted function
evaluation once the required infrastructure to support the same is
in-place. I think those could be done as separate patches.

Notes -
1. This eventually needs to be rebased on top of bug-fixes posted by
Robert for parallelism [1]/messages/by-id/CA+TgmoapgKdy_Z0W9mHqZcGSo2t_t-4_V36DXaKim+X_fYp0oQ@mail.gmail.com. One of the temporary fix has been done
in ExecReScanGather() to allow rescan of Gather node, the actual fix
will be incorporated by bug-fix patches.

2. Done pgindent on changed files, so you might see some indentation
changes which are not directly related to this patch, but are from previous
parallel seq scan work especially in execParallel.c.

3. Apply this patch on top of parallel heap scan patches [2]/messages/by-id/CAA4eK1KCymW+-vJuAgSxf-s4K-0X3dBxDcw5Hem+qSgergxY4A@mail.gmail.com

[1]: /messages/by-id/CA+TgmoapgKdy_Z0W9mHqZcGSo2t_t-4_V36DXaKim+X_fYp0oQ@mail.gmail.com
/messages/by-id/CA+TgmoapgKdy_Z0W9mHqZcGSo2t_t-4_V36DXaKim+X_fYp0oQ@mail.gmail.com
[2]: /messages/by-id/CAA4eK1KCymW+-vJuAgSxf-s4K-0X3dBxDcw5Hem+qSgergxY4A@mail.gmail.com
/messages/by-id/CAA4eK1KCymW+-vJuAgSxf-s4K-0X3dBxDcw5Hem+qSgergxY4A@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_partialseqscan_v20.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v20.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..c76bfb0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -730,6 +730,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -853,6 +854,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_Gather:
 			pname = sname = "Gather";
 			break;
@@ -1006,6 +1010,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
 		case T_SubqueryScan:
@@ -1270,6 +1275,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2354,6 +2360,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..38a92fe 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,8 +21,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 163650c..f2f9c30 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -161,6 +162,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_GatherState:
 			ExecReScanGather((GatherState *) node);
 			break;
@@ -472,6 +477,7 @@ ExecSupportsBackwardScan(Plan *node)
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
 
+		case T_PartialSeqScan:
 		case T_Gather:
 			return false;
 
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..5bd00cc 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -262,6 +262,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 */
 		case T_SeqScanState:
 		case T_SampleScanState:
+		case T_PartialSeqScanState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
 		case T_BitmapHeapScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e6930c1..966ffb5 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -17,6 +17,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -42,16 +43,16 @@
 /* DSM structure for accumulating per-PlanState instrumentation. */
 typedef struct SharedPlanStateInstrumentation
 {
-	int plan_node_id;
-	slock_t mutex;
-	Instrumentation	instr;
-} SharedPlanStateInstrumentation;
+	int			plan_node_id;
+	slock_t		mutex;
+	Instrumentation instr;
+}	SharedPlanStateInstrumentation;
 
 /* DSM structure for accumulating per-PlanState instrumentation. */
 struct SharedExecutorInstrumentation
 {
-	int instrument_options;
-	int ps_ninstrument;			/* # of ps_instrument structures following */
+	int			instrument_options;
+	int			ps_ninstrument; /* # of ps_instrument structures following */
 	SharedPlanStateInstrumentation ps_instrument[FLEXIBLE_ARRAY_MEMBER];
 };
 
@@ -59,26 +60,34 @@ struct SharedExecutorInstrumentation
 typedef struct ExecParallelEstimateContext
 {
 	ParallelContext *pcxt;
-	int nnodes;
-} ExecParallelEstimateContext;
+	Size		pscan_len;
+	int			nnodes;
+}	ExecParallelEstimateContext;
 
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelInitializeDSMContext
 {
 	ParallelContext *pcxt;
 	SharedExecutorInstrumentation *instrumentation;
-	int nnodes;
-} ExecParallelInitializeDSMContext;
+	Size		pscan_len;
+	int			nnodes;
+}	ExecParallelInitializeDSMContext;
+
+/*
+ * This is required for parallel plan execution to fetch the information
+ * from dsm.
+ */
+static shm_toc *parallel_shm_toc = NULL;
 
 /* Helper functions that run in the parallel leader. */
 static char *ExecSerializePlan(Plan *plan, EState *estate);
 static bool ExecParallelEstimate(PlanState *node,
-					 ExecParallelEstimateContext *e);
+					 ExecParallelEstimateContext * e);
 static bool ExecParallelInitializeDSM(PlanState *node,
-					 ExecParallelInitializeDSMContext *d);
+						  ExecParallelInitializeDSMContext * d);
 static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt);
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
-						  SharedExecutorInstrumentation *instrumentation);
+							SharedExecutorInstrumentation * instrumentation);
 
 /* Helper functions that run in the parallel worker. */
 static void ParallelQueryMain(dsm_segment *seg, shm_toc *toc);
@@ -150,7 +159,7 @@ ExecSerializePlan(Plan *plan, EState *estate)
  * we know how many SharedPlanStateInstrumentation structures we need.
  */
 static bool
-ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
+ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext * e)
 {
 	if (planstate == NULL)
 		return false;
@@ -158,10 +167,24 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	/* Count this node. */
 	e->nnodes++;
 
-	/*
-	 * XXX. Call estimators for parallel-aware nodes here, when we have
-	 * some.
-	 */
+	/* Call estimators for parallel-aware nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			{
+				EState	   *estate = ((PartialSeqScanState *) planstate)->ss.ps.state;
+
+				e->pscan_len = heap_parallelscan_estimate(estate->es_snapshot);
+				shm_toc_estimate_chunk(&e->pcxt->estimator, e->pscan_len);
+
+				/* key for partial scan information. */
+				shm_toc_estimate_keys(&e->pcxt->estimator, 1);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
@@ -175,8 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
  */
 static bool
 ExecParallelInitializeDSM(PlanState *planstate,
-						  ExecParallelInitializeDSMContext *d)
+						  ExecParallelInitializeDSMContext * d)
 {
+	ParallelHeapScanDesc pscan;
+
 	if (planstate == NULL)
 		return false;
 
@@ -196,10 +221,32 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/*
-	 * XXX. Call initializers for parallel-aware plan nodes, when we have
-	 * some.
-	 */
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			{
+				EState	   *estate = ((PartialSeqScanState *) planstate)->ss.ps.state;
+
+				/*
+				 * Store parallel heap scan descriptor in dynamic shared
+				 * memory.
+				 */
+				pscan = shm_toc_allocate(d->pcxt->toc,
+										 d->pscan_len);
+				heap_parallelscan_initialize(pscan,
+				  ((PartialSeqScanState *) planstate)->ss.ss_currentRelation,
+											 estate->es_snapshot,
+											 true);
+				shm_toc_insert(d->pcxt->toc,
+				((PartialSeqScanState *) planstate)->ss.ps.plan->plan_node_id,
+							   pscan);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
 }
@@ -310,10 +357,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
-	 * Give parallel-aware nodes a chance to add to the estimates, and get
-	 * a count of how many PlanState nodes there are.
+	 * Give parallel-aware nodes a chance to add to the estimates, and get a
+	 * count of how many PlanState nodes there are.
 	 */
 	e.pcxt = pcxt;
+	e.pscan_len = 0;
 	e.nnodes = 0;
 	ExecParallelEstimate(planstate, &e);
 
@@ -358,9 +406,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	pei->tqueue = ExecParallelSetupTupleQueues(pcxt);
 
 	/*
-	 * If instrumentation options were supplied, allocate space for the
-	 * data.  It only gets partially initialized here; the rest happens
-	 * during ExecParallelInitializeDSM.
+	 * If instrumentation options were supplied, allocate space for the data.
+	 * It only gets partially initialized here; the rest happens during
+	 * ExecParallelInitializeDSM.
 	 */
 	if (estate->es_instrument)
 	{
@@ -379,6 +427,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	 */
 	d.pcxt = pcxt;
 	d.instrumentation = instrumentation;
+	d.pscan_len = e.pscan_len;
 	d.nnodes = 0;
 	ExecParallelInitializeDSM(planstate, &d);
 
@@ -399,10 +448,10 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
  */
 static bool
 ExecParallelRetrieveInstrumentation(PlanState *planstate,
-						  SharedExecutorInstrumentation *instrumentation)
+							 SharedExecutorInstrumentation * instrumentation)
 {
-	int		i;
-	int		plan_node_id = planstate->plan->plan_node_id;
+	int			i;
+	int			plan_node_id = planstate->plan->plan_node_id;
 	SharedPlanStateInstrumentation *ps_instrument;
 
 	/* Find the instumentation for this node. */
@@ -427,7 +476,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 void
 ExecParallelFinish(ParallelExecutorInfo *pei)
 {
-	int		i;
+	int			i;
 
 	/* First, wait for the workers to finish. */
 	WaitForParallelWorkersToFinish(pei->pcxt);
@@ -499,17 +548,17 @@ ExecParallelGetQueryDesc(shm_toc *toc, DestReceiver *receiver,
  */
 static bool
 ExecParallelReportInstrumentation(PlanState *planstate,
-						  SharedExecutorInstrumentation *instrumentation)
+							 SharedExecutorInstrumentation * instrumentation)
 {
-	int		i;
-	int		plan_node_id = planstate->plan->plan_node_id;
+	int			i;
+	int			plan_node_id = planstate->plan->plan_node_id;
 	SharedPlanStateInstrumentation *ps_instrument;
 
 	/*
 	 * If we shuffled the plan_node_id values in ps_instrument into sorted
-	 * order, we could use binary search here.  This might matter someday
-	 * if we're pushing down sufficiently large plan trees.  For now, do it
-	 * the slow, dumb way.
+	 * order, we could use binary search here.  This might matter someday if
+	 * we're pushing down sufficiently large plan trees.  For now, do it the
+	 * slow, dumb way.
 	 */
 	for (i = 0; i < instrumentation->ps_ninstrument; ++i)
 		if (instrumentation->ps_instrument[i].plan_node_id == plan_node_id)
@@ -518,8 +567,8 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 		elog(ERROR, "plan node %d not found", plan_node_id);
 
 	/*
-	 * There's one SharedPlanStateInstrumentation per plan_node_id, so we
-	 * must use a spinlock in case multiple workers report at the same time.
+	 * There's one SharedPlanStateInstrumentation per plan_node_id, so we must
+	 * use a spinlock in case multiple workers report at the same time.
 	 */
 	ps_instrument = &instrumentation->ps_instrument[i];
 	SpinLockAcquire(&ps_instrument->mutex);
@@ -531,6 +580,15 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * GetParallelShmToc
+ */
+shm_toc *
+GetParallelShmToc(void)
+{
+	return parallel_shm_toc;
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
@@ -561,6 +619,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		instrument_options = instrumentation->instrument_options;
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
+	parallel_shm_toc = toc;
+
 	/* Prepare to track buffer usage during query execution. */
 	InstrStartParallelQuery();
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5bc1d48..87b022d 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -309,6 +310,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												  estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_Gather:
 			result = (PlanState *) ExecInitGather((Gather *) node,
 												  estate, eflags);
@@ -511,6 +517,10 @@ ExecProcNode(PlanState *node)
 			result = ExecUnique((UniqueState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_GatherState:
 			result = ExecGather((GatherState *) node);
 			break;
@@ -669,6 +679,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_GatherState:
 			ExecEndGather((GatherState *) node);
 			break;
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index c689a4d..57e4cd4 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -115,6 +115,8 @@ ExecGather(GatherState *node)
 										 estate,
 								  ((Gather *) (node->ps.plan))->num_workers);
 
+		outerPlanState(node)->toc = node->pei->pcxt->toc;
+
 		/*
 		 * Register backend workers. If the required number of workers are not
 		 * available then we perform the scan with available workers and if
@@ -247,7 +249,7 @@ gather_getnext(GatherState *gatherstate)
 void
 ExecShutdownGather(GatherState *node)
 {
-	Gather *gather;
+	Gather	   *gather;
 
 	if (node->pei == NULL || node->pei->pcxt == NULL)
 		return;
@@ -295,5 +297,15 @@ ExecReScanGather(GatherState *node)
 	 */
 	ExecShutdownGather(node);
 
+	/*
+	 * free the parallel executor information so that during next execution,
+	 * parallel context and workers could be initialized.
+	 */
+	if (node->pei)
+	{
+		pfree(node->pei);
+		node->pei = NULL;
+	}
+
 	ExecReScan(node->ps.lefttree);
 }
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..f68a449
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,310 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check are
+	 * keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+	shm_toc    *toc;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	/*
+	 * Parallel scan descriptor is initialized and stored in dynamic shared
+	 * memory segment by master backend and parallel workers retrieve it from
+	 * shared memory.  We set 'toc' (place to lookup parallel scan descriptor)
+	 * as retrievied by attaching to dsm for parallel workers whereas master
+	 * backend stores it directly in partial scan state node after
+	 * initializing workers.
+	 */
+	toc = GetParallelShmToc();
+	if (toc)
+		node->ss.ps.toc = toc;
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	scanstate->scan_initialized = false;
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize it
+	 * during ExecutorStart phase, however we need ParallelHeapScanDesc to
+	 * initialize the scan in case of this node and the same is initialized by
+	 * the Funnel node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		ParallelHeapScanDesc pscan;
+
+		/*
+		 * Parallel scan descriptor is initialized and stored in dynamic
+		 * shared memory segment by master backend, parallel workers and local
+		 * scan by master backend retrieve it from shared memory.  If the scan
+		 * descriptor is available on first execution, then we need to
+		 * re-initialize for rescan.
+		 */
+		Assert(node->ss.ps.toc);
+
+		pscan = shm_toc_lookup(node->ss.ps.toc, node->ss.ps.plan->plan_node_id);
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 8d3dde0..b348bfd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -75,6 +75,13 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Result node can be added as a gating node on top of PartialSeqScan
+	 * node, so need to percolate toc information to outer node.
+	 */
+	if (node->ps.toc)
+		outerPlanState(node)->toc = node->ps.toc;
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0b4ab23..c7194d5 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -405,6 +405,22 @@ _copySampleScan(const SampleScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copyIndexScan
  */
 static IndexScan *
@@ -4257,6 +4273,9 @@ copyObject(const void *from)
 		case T_Scan:
 			retval = _copyScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_Gather:
 			retval = _copyGather(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index df7f6e1..20e9ef7 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -433,6 +433,14 @@ _outBitmapOr(StringInfo str, const BitmapOr *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outGather(StringInfo str, const Gather *node)
 {
 	WRITE_NODE_TYPE("GATHER");
@@ -3010,6 +3018,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_BitmapOr:
 				_outBitmapOr(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_Gather:
 				_outGather(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 5802a73..398e7ce 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1621,6 +1621,19 @@ _readSampleScan(void)
 }
 
 /*
+ * _readPartialSeqScan
+ */
+static PartialSeqScan *
+_readPartialSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(PartialSeqScan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
  * _readIndexScan
  */
 static IndexScan *
@@ -2338,6 +2351,8 @@ parseNodeString(void)
 		return_value = _readSeqScan();
 	else if (MATCH("SAMPLESCAN", 10))
 		return_value = _readSampleScan();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readPartialSeqScan();
 	else if (MATCH("INDEXSCAN", 9))
 		return_value = _readIndexScan();
 	else if (MATCH("INDEXONLYSCAN", 13))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1b61fd9..0a4a904 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -296,6 +296,49 @@ cost_samplescan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_partialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan via the
+	 * ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers. Here assumption is
+	 * that disk access cost will also be equally shared between workers which
+	 * is generally true unless there are too many workers working on a
+	 * relatively lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for partial
+	 * sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_gather
  *	  Determines and returns the cost of gather path.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..02d2392
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * expr_is_parallel_safe
+ *	  is a paraticular expression parallel safe
+ *
+ * Conditions checked here:
+ *
+ * 1. The expresion must not contain any parallel unsafe or parallel
+ * restricted functions.
+ *
+ * 2. The expression must not contain any initplan or subplan.  We can
+ * probably remove this restriction once we have support of infrastructure
+ * for execution of initplans and subplans at parallel (Gather) nodes.
+ */
+bool
+expr_is_parallel_safe(Node *node)
+{
+	if (check_parallel_safety(node, false))
+		return false;
+
+	if (contain_subplans_or_initplans(node))
+		return false;
+
+	return true;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+	ListCell   *l;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (max_parallel_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * Allow parallel paths only if all the clauses for relation are parallel
+	 * safe.  We can allow execution of parallel restricted clauses in master
+	 * backend, but for that planner should have infrastructure to pull all
+	 * the parallel restricted clauses from below nodes to the Gather node
+	 * which will then execute such clauses in master backend.
+	 */
+	foreach(l, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (!expr_is_parallel_safe((Node *) rinfo->clause))
+			return;
+	}
+
+	num_parallel_workers = Min(max_parallel_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the funnel path which master backend needs to execute. */
+	add_path(rel, (Path *) create_gather_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0ee7392..47758a3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -60,6 +60,8 @@ static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses);
 static Gather *create_gather_plan(PlannerInfo *root,
 				   GatherPath *best_path);
 static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
@@ -106,6 +108,8 @@ static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+					Index scanrelid);
 static Gather *make_gather(List *qptlist, List *qpqual,
 			int nworkers, bool single_copy, Plan *subplan);
 static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
@@ -238,6 +242,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -364,6 +369,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												   scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_IndexScan:
 			plan = (Plan *) create_indexscan_plan(root,
 												  (IndexPath *) best_path,
@@ -568,6 +580,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	{
 		case T_SeqScan:
 		case T_SampleScan:
+		case T_PartialSeqScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
 		case T_BitmapHeapScan:
@@ -1110,6 +1123,46 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path)
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan	   *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_gather_plan
  *
  *	  Create a Gather plan for 'best_path' and (recursively) plans
@@ -4771,6 +4824,24 @@ make_unique(Plan *lefttree, List *distinctList)
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static Gather *
 make_gather(List *qptlist,
 			List *qpqual,
@@ -5169,6 +5240,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Append:
 		case T_MergeAppend:
 		case T_RecursiveUnion:
+		case T_Gather:
 			return false;
 		default:
 			break;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e1ee67c..4ebbabf 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -201,13 +201,13 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->hasRowSecurity = false;
 
 	/*
-	 * Assess whether it's feasible to use parallel mode for this query.
-	 * We can't do this in a standalone backend, or if the command will
-	 * try to modify any data, or if this is a cursor operation, or if any
+	 * Assess whether it's feasible to use parallel mode for this query. We
+	 * can't do this in a standalone backend, or if the command will try to
+	 * modify any data, or if this is a cursor operation, or if any
 	 * parallel-unsafe functions are present in the query tree.
 	 *
-	 * For now, we don't try to use parallel mode if we're running inside
-	 * a parallel worker.  We might eventually be able to relax this
+	 * For now, we don't try to use parallel mode if we're running inside a
+	 * parallel worker.  We might eventually be able to relax this
 	 * restriction, but for now it seems best not to have parallel workers
 	 * trying to create their own parallel workers.
 	 */
@@ -215,7 +215,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
 		parse->utilityStmt == NULL && !IsParallelWorker() &&
-		!contain_parallel_unsafe((Node *) parse);
+		!check_parallel_safety((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
@@ -228,9 +228,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 *
 	 * (It's been suggested that we should always impose these restrictions
 	 * whenever glob->parallelModeOK is true, so that it's easier to notice
-	 * incorrectly-labeled functions sooner.  That might be the right thing
-	 * to do, but for now I've taken this approach.  We could also control
-	 * this with a GUC.)
+	 * incorrectly-labeled functions sooner.  That might be the right thing to
+	 * do, but for now I've taken this approach.  We could also control this
+	 * with a GUC.)
 	 *
 	 * FIXME: It's assumed that code further down will set parallelModeNeeded
 	 * to true if a parallel path is actually chosen.  Since the core
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 9392d61..aff78fc 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -447,6 +447,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 6b32f85..b85e4f6 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2234,6 +2234,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..2355cc6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -87,16 +87,25 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+}	check_parallel_safety_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
 static bool find_window_functions_walker(Node *node, WindowFuncLists *lists);
 static bool expression_returns_set_rows_walker(Node *node, double *count);
 static bool contain_subplans_walker(Node *node, void *context);
+static bool contain_subplans_or_initplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool check_parallel_safety_walker(Node *node,
+							 check_parallel_safety_arg * context);
+static bool parallel_too_dangerous(char proparallel,
+					   check_parallel_safety_arg * context);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1204,13 +1213,16 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
  *****************************************************************************/
 
 bool
-contain_parallel_unsafe(Node *node)
+check_parallel_safety(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	check_parallel_safety_arg context;
+
+	context.allow_restricted = allow_restricted;
+	return check_parallel_safety_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+check_parallel_safety_walker(Node *node, check_parallel_safety_arg * context)
 {
 	if (node == NULL)
 		return false;
@@ -1218,7 +1230,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1227,7 +1239,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1236,7 +1248,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1245,7 +1257,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1254,7 +1266,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1268,12 +1280,12 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1282,7 +1294,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1294,28 +1306,77 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			if (parallel_too_dangerous(op_volatile(lfirst_oid(opid)), context))
 				return true;
 		}
 		/* else fall through to check args */
 	}
 	else if (IsA(node, Query))
 	{
-		Query *query = (Query *) node;
+		Query	   *query = (Query *) node;
 
 		if (query->rowMarks != NULL)
 			return true;
 
 		/* Recurse into subselects */
 		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
+								 check_parallel_safety_walker,
 								 context, 0);
 	}
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  check_parallel_safety_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, check_parallel_safety_arg * context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+/*
+ * contain_subplans_or_initplans
+ *	  Recursively search for initplan or subplan nodes within a clause.
+ *
+ * A special purpose function for prohibiting subplan or initplan clauses
+ * in parallel query constructs.
+ *
+ * If we see any form of SubPlan node, we will return TRUE.  For InitPlan's,
+ * we return true when we see the Param node, apart from that InitPlan
+ * can contain a simple NULL constant for MULTIEXPR subquery (see comments
+ * in make_subplan), however it is okay not to care about the same as that
+ * is only possible for Update statement which is anyway prohibited.
+ *
+ * Returns true if any subplan or initplan is found.
+ */
+bool
+contain_subplans_or_initplans(Node *clause)
+{
+	return contain_subplans_or_initplans_walker(clause, NULL);
+}
+
+static bool
+contain_subplans_or_initplans_walker(Node *node, void *context)
+{
+	if (node == NULL)
+		return false;
+	if (IsA(node, SubPlan) ||
+		IsA(node, AlternativeSubPlan) ||
+		IsA(node, SubLink))
+		return true;			/* abort the tree traversal and return true */
+	else if (IsA(node, Param))
+	{
+		Param	   *paramval = (Param *) node;
+
+		if (paramval->paramkind == PARAM_EXEC)
+			return true;
+	}
+	return expression_tree_walker(node, contain_subplans_or_initplans_walker, context);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1895a68..68863b9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1308,6 +1308,28 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_partialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_gather_path
  *
  *	  Creates a path corresponding to a gather scan, returning the
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 4fc797a..ccc6c9b 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -22,15 +22,16 @@ typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
 
 typedef struct ParallelExecutorInfo
 {
-	PlanState *planstate;
+	PlanState  *planstate;
 	ParallelContext *pcxt;
 	BufferUsage *buffer_usage;
 	SharedExecutorInstrumentation *instrumentation;
 	shm_mq_handle **tqueue;
-}	ParallelExecutorInfo;
+} ParallelExecutorInfo;
 
 extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
 					 EState *estate, int nworkers);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
+extern shm_toc *GetParallelShmToc(void);
 
 #endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..cec09ad
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+					   EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b6895f9..1916357 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
@@ -1049,6 +1050,13 @@ typedef struct PlanState
 	Bitmapset  *chgParam;		/* set of IDs of changed Params */
 
 	/*
+	 * At execution time, parallel scan descriptor is initialized and stored
+	 * in dynamic shared memory segment by master backend and parallel workers
+	 * retrieve it from shared memory.
+	 */
+	shm_toc    *toc;
+
+	/*
 	 * Other run-time state needed by most if not all node types.
 	 */
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
@@ -1273,6 +1281,18 @@ typedef struct SampleScanState
 } SampleScanState;
 
 /*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		scan_initialized;		/* used to determine if the scan is
+										 * initialized */
+} PartialSeqScanState;
+
+
+/*
  * These structs store information about index quals that don't have simple
  * constant right-hand sides.  See comments for ExecIndexBuildScanKeys()
  * for discussion.
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 94bdb7c..d1feab2 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -52,6 +52,7 @@ typedef enum NodeTag
 	T_Scan,
 	T_SeqScan,
 	T_SampleScan,
+	T_PartialSeqScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
 	T_BitmapIndexScan,
@@ -100,6 +101,7 @@ typedef enum NodeTag
 	T_ScanState,
 	T_SeqScanState,
 	T_SampleScanState,
+	T_PartialSeqScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
 	T_BitmapIndexScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 1f9213c..a65cd22 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -72,7 +72,7 @@ typedef struct PlannedStmt
 
 	bool		hasRowSecurity; /* row security applied? */
 
-	bool		parallelModeNeeded; /* parallel mode required to execute? */
+	bool		parallelModeNeeded;		/* parallel mode required to execute? */
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -287,6 +287,12 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
  *		table sample scan node
  * ----------------
  */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..747b05b 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,8 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool check_parallel_safety(Node *node, bool allow_restricted);
+extern bool contain_subplans_or_initplans(Node *clause);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 25a7303..e8c87dd 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -75,6 +75,9 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
+extern void cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
 		   double loop_count);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7a4940c..06ae16f 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,8 @@ extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers);
 extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..6cd4479 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,15 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer);
+extern bool expr_is_parallel_safe(Node *node);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..5492ba0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1199,6 +1199,8 @@ OverrideStackEntry
 PACE_HEADER
 PACL
 ParallelExecutorInfo
+PartialSeqScan
+PartialSeqScanState
 PATH
 PBOOL
 PCtxtHandle
#387Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#386)
Re: Parallel Seq Scan

On Tue, Oct 13, 2015 at 2:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached is rebased patch for partial seqscan support.

Review comments:

- If you're going to pgindent execParallel.c, you need to add some
entries to typedefs.list so it doesn't mangle the formatting.
ExecParallelEstimate's parameter list is misformatted, for example.
Also, I think if we're going to do this we had better extract the
pgindent changes and commit those first. It's pretty distracting the
way you have it.

- Instead of inlining the work needed by each parallel mode in
ExecParallelEstimate(), I think you should mimic the style of
ExecProcNode and call a node-type specific function that is part of
that node's public interface - here, ExecPartialSeqScanEstimate,
perhaps. Similarly for ExecParallelInitializeDSM. Perhaps
ExecPartialSeqScanInitializeDSM.

- I continue to think GetParallelShmToc is the wrong approach.
Instead, each time ExecParallelInitializeDSM or
ExecParallelInitializeDSM calls a nodetype-specific initialized
function (as described in the previous point), have it pass d->pcxt as
an argument. The node can get the toc from there if it needs it. I
suppose it could store a pointer to the toc in its scanstate if it
needs it, but it really shouldn't. Instead, it should store a pointer
to, say, the ParallelHeapScanDesc in the scanstate. It really should
only care about its own shared data, so once it finds that, the toc
shouldn't be needed any more. Then ExecPartialSeqScan doesn't need to
look up pscan; it's already recorded in the scanstate.

- ExecParallelInitializeDSMContext's new pscan_len member is 100%
wrong. Individual scan nodes don't get to add stuff to that context
object. They should store details like this in their own ScanState as
needed.

- The positioning of the new nodes in various lists doesn't seem to
entirely consistent. nodes.h adds them after SampleScan which isn't
terrible, though maybe immediately after SeqScan would be better, but
_outNode has it right after BitmapOr and the switch in _copyObject has
it somewhere else again.

- Although the changes in parallelpaths.c are in a good direction, I'm
pretty sure this is not yet up to scratch. I am less sure exactly
what needs to be fixed, so I'll have to give some more thought to
that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#388Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#385)
Re: Parallel Seq Scan

On Mon, Oct 12, 2015 at 11:46:08AM -0400, Robert Haas wrote:

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this

It works because PL/pgSQL creates an unshared list whenever copyParamList() is
forthcoming. (This in turn relies on intimate knowledge of how the rest of
the system processes param lists.) The comments at setup_param_list() and
setup_unshared_param_list() are most pertinent.

test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list. Here, I've taken the approach of making that
check unconditional, which seems to work, but I'm not sure if some
other approach would be better, such as adding an additional Boolean
(or enum context?) argument to ParamFetchHook. I *think* that
skipping this check is merely a performance optimization rather than
anything that affects correctness, and bms_is_member() is pretty
cheap, so perhaps the way I've done it is OK.

Like you, I don't expect bms_is_member() to be expensive relative to the task
at hand. However, copyParamList() and SerializeParamList() copy non-dynamic
params without calling plpgsql_param_fetch(). Given the shared param list,
they will copy non-dynamic params the current query doesn't use. That cost is
best avoided, not being well-bounded; consider the case of an unrelated
variable containing a TOAST pointer to a 1-GiB value. One approach is to have
setup_param_list() copy the paramnos pointer to a new ParamListInfoData field:

Bitmapset *paramMask; /* if non-NULL, ignore params lacking a 1-bit */

Test it directly in copyParamList() and SerializeParamList(). As a bonus,
that would allow use of the shared param list for more cases involving
cursors. Furthermore, plpgsql_param_fetch() would never need to test
paramnos. A more-general alternative is to have a distinct "paramIsUsed"
callback, but I don't know how one would exploit the extra generality.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#389Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#387)
Re: Parallel Seq Scan

On Wed, Oct 14, 2015 at 3:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 13, 2015 at 2:45 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Attached is rebased patch for partial seqscan support.

Review comments:

- I continue to think GetParallelShmToc is the wrong approach.
Instead, each time ExecParallelInitializeDSM or
ExecParallelInitializeDSM calls a nodetype-specific initialized
function (as described in the previous point), have it pass d->pcxt as
an argument. The node can get the toc from there if it needs it. I
suppose it could store a pointer to the toc in its scanstate if it
needs it, but it really shouldn't. Instead, it should store a pointer
to, say, the ParallelHeapScanDesc in the scanstate.

How will this idea work for worker backend. Basically in worker
if want something like this to work, toc has to be passed via
QueryDesc to Estate and then we can retrieve ParallelHeapScanDesc
during PartialSeqScan initialization (ExecInitPartialSeqScan).
Do you have something else in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#390Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#389)
Re: Parallel Seq Scan

On Wed, Oct 14, 2015 at 12:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

- I continue to think GetParallelShmToc is the wrong approach.
Instead, each time ExecParallelInitializeDSM or
ExecParallelInitializeDSM calls a nodetype-specific initialized
function (as described in the previous point), have it pass d->pcxt as
an argument. The node can get the toc from there if it needs it. I
suppose it could store a pointer to the toc in its scanstate if it
needs it, but it really shouldn't. Instead, it should store a pointer
to, say, the ParallelHeapScanDesc in the scanstate.

How will this idea work for worker backend. Basically in worker
if want something like this to work, toc has to be passed via
QueryDesc to Estate and then we can retrieve ParallelHeapScanDesc
during PartialSeqScan initialization (ExecInitPartialSeqScan).
Do you have something else in mind?

Good question. I think when the worker starts up it should call yet
another planstate-walker, e.g. ExecParallelInitializeWorker, which can
call nodetype-specific functions for parallel-aware nodes and give
each of them a chance to access the toc and store a pointer to their
parallel shared state (ParallelHeapScanDesc in this case) in their
scanstate. I think this should get called from ParallelQueryMain
after ExecutorStart and before ExecutorRun:
ExecParallelInitializeWorker(queryDesc->planstate, toc).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#391Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#388)
Re: Parallel Seq Scan

On Tue, Oct 13, 2015 at 9:08 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Oct 12, 2015 at 11:46:08AM -0400, Robert Haas wrote:

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this

It works because PL/pgSQL creates an unshared list whenever copyParamList() is
forthcoming. (This in turn relies on intimate knowledge of how the rest of
the system processes param lists.) The comments at setup_param_list() and
setup_unshared_param_list() are most pertinent.

Thanks for the pointer.

test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list. Here, I've taken the approach of making that
check unconditional, which seems to work, but I'm not sure if some
other approach would be better, such as adding an additional Boolean
(or enum context?) argument to ParamFetchHook. I *think* that
skipping this check is merely a performance optimization rather than
anything that affects correctness, and bms_is_member() is pretty
cheap, so perhaps the way I've done it is OK.

Like you, I don't expect bms_is_member() to be expensive relative to the task
at hand. However, copyParamList() and SerializeParamList() copy non-dynamic
params without calling plpgsql_param_fetch(). Given the shared param list,
they will copy non-dynamic params the current query doesn't use. That cost is
best avoided, not being well-bounded; consider the case of an unrelated
variable containing a TOAST pointer to a 1-GiB value. One approach is to have
setup_param_list() copy the paramnos pointer to a new ParamListInfoData field:

Bitmapset *paramMask; /* if non-NULL, ignore params lacking a 1-bit */

Test it directly in copyParamList() and SerializeParamList(). As a bonus,
that would allow use of the shared param list for more cases involving
cursors. Furthermore, plpgsql_param_fetch() would never need to test
paramnos. A more-general alternative is to have a distinct "paramIsUsed"
callback, but I don't know how one would exploit the extra generality.

I'm anxious to minimize the number of things that must be fixed in
order for a stable version of parallel query to exist in our master
repository, and I fear that trying to improve ParamListInfo generally
could take me fairly far afield. How about adding a paramListCopyHook
to ParamListInfoData? SerializeParamList() would, if it found a
parameter with !OidIsValid(prm->prmtype) && param->paramFetch != NULL,
call this function, which would return a new ParamListInfo to be
serialized in place of the original? This wouldn't require any
modification to the current plpgsql_param_fetch() at all, but the new
function would steal its bms_is_member() test. Furthermore, no user
of ParamListInfo other than plpgsql needs to care at all -- which,
with your proposals, they would.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#392Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#391)
Re: Parallel Seq Scan

On Wed, Oct 14, 2015 at 07:52:15PM -0400, Robert Haas wrote:

On Tue, Oct 13, 2015 at 9:08 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Oct 12, 2015 at 11:46:08AM -0400, Robert Haas wrote:

Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list.

Like you, I don't expect bms_is_member() to be expensive relative to the task
at hand. However, copyParamList() and SerializeParamList() copy non-dynamic
params without calling plpgsql_param_fetch(). Given the shared param list,
they will copy non-dynamic params the current query doesn't use. That cost is
best avoided, not being well-bounded; consider the case of an unrelated
variable containing a TOAST pointer to a 1-GiB value. One approach is to have
setup_param_list() copy the paramnos pointer to a new ParamListInfoData field:

Bitmapset *paramMask; /* if non-NULL, ignore params lacking a 1-bit */

Test it directly in copyParamList() and SerializeParamList(). As a bonus,
that would allow use of the shared param list for more cases involving
cursors. Furthermore, plpgsql_param_fetch() would never need to test
paramnos. A more-general alternative is to have a distinct "paramIsUsed"
callback, but I don't know how one would exploit the extra generality.

I'm anxious to minimize the number of things that must be fixed in
order for a stable version of parallel query to exist in our master
repository, and I fear that trying to improve ParamListInfo generally
could take me fairly far afield. How about adding a paramListCopyHook
to ParamListInfoData? SerializeParamList() would, if it found a
parameter with !OidIsValid(prm->prmtype) && param->paramFetch != NULL,
call this function, which would return a new ParamListInfo to be
serialized in place of the original?

Tests of prm->prmtype and paramLI->paramFetch appear superfluous. Given that
the paramListCopyHook callback would return a complete substitute
ParamListInfo, I wouldn't expect SerializeParamList() to examine the the
original paramLI->params at all. If that's correct, the paramListCopyHook
design sounds fine. However, its implementation will be more complex than
paramMask would have been.

This wouldn't require any
modification to the current plpgsql_param_fetch() at all, but the new
function would steal its bms_is_member() test. Furthermore, no user
of ParamListInfo other than plpgsql needs to care at all -- which,
with your proposals, they would.

To my knowledge, none of these approaches would compel existing users to care.
They would leave paramMask or paramListCopyHook NULL and get today's behavior.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#393Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#387)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Oct 14, 2015 at 3:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 13, 2015 at 2:45 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Attached is rebased patch for partial seqscan support.

Review comments:

- If you're going to pgindent execParallel.c, you need to add some
entries to typedefs.list so it doesn't mangle the formatting.
ExecParallelEstimate's parameter list is misformatted, for example.
Also, I think if we're going to do this we had better extract the
pgindent changes and commit those first. It's pretty distracting the
way you have it.

Agreed, we can do those separately. So, I have reverted those changes.

- Instead of inlining the work needed by each parallel mode in
ExecParallelEstimate(), I think you should mimic the style of
ExecProcNode and call a node-type specific function that is part of
that node's public interface - here, ExecPartialSeqScanEstimate,
perhaps. Similarly for ExecParallelInitializeDSM. Perhaps
ExecPartialSeqScanInitializeDSM.

- I continue to think GetParallelShmToc is the wrong approach.
Instead, each time ExecParallelInitializeDSM or
ExecParallelInitializeDSM calls a nodetype-specific initialized
function (as described in the previous point), have it pass d->pcxt as
an argument. The node can get the toc from there if it needs it. I
suppose it could store a pointer to the toc in its scanstate if it
needs it, but it really shouldn't. Instead, it should store a pointer
to, say, the ParallelHeapScanDesc in the scanstate. It really should
only care about its own shared data, so once it finds that, the toc
shouldn't be needed any more. Then ExecPartialSeqScan doesn't need to
look up pscan; it's already recorded in the scanstate.

- ExecParallelInitializeDSMContext's new pscan_len member is 100%
wrong. Individual scan nodes don't get to add stuff to that context
object. They should store details like this in their own ScanState as
needed.

Changed the patch as per above suggestions.

- The positioning of the new nodes in various lists doesn't seem to
entirely consistent. nodes.h adds them after SampleScan which isn't
terrible, though maybe immediately after SeqScan would be better, but
_outNode has it right after BitmapOr and the switch in _copyObject has
it somewhere else again.

I think this got messed up while rebasing on top of Gather node
changes, but nonetheless, I have changed it such that PartialSeqScan
node handling is after SeqScan.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_partialseqscan_v21.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v21.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..d03fbde 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -729,6 +729,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -850,6 +851,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
@@ -1005,6 +1009,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
@@ -1270,6 +1275,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2353,6 +2359,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..38a92fe 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,8 +21,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 163650c..b3d041c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -157,6 +158,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
@@ -468,6 +473,9 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_PartialSeqScan:
+			return false;
+
 		case T_SampleScan:
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..6e05598 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
 		case T_SampleScanState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e6930c1..b0a7ce4 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -17,6 +17,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -158,10 +159,19 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	/* Count this node. */
 	e->nnodes++;
 
-	/*
-	 * XXX. Call estimators for parallel-aware nodes here, when we have
-	 * some.
-	 */
+	/* Call estimators for parallel-aware nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			{
+				ExecPartialSeqScanEstimate((PartialSeqScanState *) planstate,
+										   e->pcxt);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
@@ -196,10 +206,19 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/*
-	 * XXX. Call initializers for parallel-aware plan nodes, when we have
-	 * some.
-	 */
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			{
+				ExecPartialSeqScanInitializeDSM((PartialSeqScanState *) planstate,
+												d->pcxt);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
 }
@@ -531,6 +550,35 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Initialize the PlanState and it's descendents with the information
+ * retrieved from shared memory.  This has to be done once the PlanState
+ * is allocated and initialized by executor for each node aka after
+ * ExecutorStart().
+ */
+static bool
+ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
+{
+	if (planstate == NULL)
+		return false;
+
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			{
+				ExecPartialSeqScanInitParallelScanDesc((PartialSeqScanState *) planstate,
+													   toc);
+				return true;
+			}
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
@@ -566,6 +614,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 
 	/* Start up the executor, have it run the plan, and then shut it down. */
 	ExecutorStart(queryDesc, 0);
+	ExecParallelInitializeWorker(queryDesc->planstate, toc);
 	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
 	ExecutorFinish(queryDesc);
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5bc1d48..4591268 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -193,6 +194,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_SampleScan:
 			result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
 													  estate, eflags);
@@ -419,6 +425,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
@@ -665,6 +675,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index c689a4d..8edc2fd 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -247,7 +247,7 @@ gather_getnext(GatherState *gatherstate)
 void
 ExecShutdownGather(GatherState *node)
 {
-	Gather *gather;
+	Gather	   *gather;
 
 	if (node->pei == NULL || node->pei->pcxt == NULL)
 		return;
@@ -295,5 +295,15 @@ ExecReScanGather(GatherState *node)
 	 */
 	ExecShutdownGather(node);
 
+	/*
+	 * free the parallel executor information so that during next execution,
+	 * parallel context and workers could be initialized.
+	 */
+	if (node->pei)
+	{
+		pfree(node->pei);
+		node->pei = NULL;
+	}
+
 	ExecReScan(node->ps.lefttree);
 }
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..3168f0c
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,348 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check are
+	 * keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanEstimate
+ *
+ *		estimates the space required to serialize partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->pscan_len = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+
+	/* key for partial scan information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitializeDSM
+ *
+ *		Initialize the DSM with the contents required to perform
+ *		partial seqscan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Store parallel heap scan descriptor in dynamic shared memory.
+	 */
+	node->pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+	heap_parallelscan_initialize(node->pscan,
+								 node->ss.ss_currentRelation,
+								 estate->es_snapshot,
+								 true);
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id,
+				   node->pscan);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitParallelDesc
+ *
+ *		Retrieve the contents from DSM related to partial seq scan node
+ *		and initialize the partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc)
+{
+	node->pscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	scanstate->scan_initialized = false;
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize it
+	 * during ExecutorStart phase, however we need ParallelHeapScanDesc to
+	 * initialize the scan in case of this node and the same is initialized by
+	 * the Gather node during ExecutorRun phase.
+	 */
+	if (!node->scan_initialized)
+	{
+		/*
+		 * If the scan descriptor is available on first execution, then we
+		 * need to re-initialize for rescan.
+		 */
+
+		if (!node->ss.ss_currentScanDesc)
+		{
+			node->ss.ss_currentScanDesc =
+				heap_beginscan_parallel(node->ss.ss_currentRelation, node->pscan);
+		}
+		else
+		{
+			heap_parallel_rescan(node->pscan, node->ss.ss_currentScanDesc);
+		}
+
+		node->scan_initialized = true;
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	if (node->scan_initialized)
+		node->scan_initialized = false;
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0b4ab23..1261737 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -384,6 +384,22 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copySampleScan
  */
 static SampleScan *
@@ -4263,6 +4279,9 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index df7f6e1..25e780f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -460,6 +460,14 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outSampleScan(StringInfo str, const SampleScan *node)
 {
 	WRITE_NODE_TYPE("SAMPLESCAN");
@@ -3019,6 +3027,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 5802a73..4405f38 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1606,6 +1606,19 @@ _readSeqScan(void)
 }
 
 /*
+ * _readPartialSeqScan
+ */
+static PartialSeqScan *
+_readPartialSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(PartialSeqScan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
  * _readSampleScan
  */
 static SampleScan *
@@ -2336,6 +2349,8 @@ parseNodeString(void)
 		return_value = _readScan();
 	else if (MATCH("SEQSCAN", 7))
 		return_value = _readSeqScan();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readPartialSeqScan();
 	else if (MATCH("SAMPLESCAN", 10))
 		return_value = _readSampleScan();
 	else if (MATCH("INDEXSCAN", 9))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1b61fd9..3239cec 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -227,6 +227,49 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_partialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan via the
+	 * ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers. Here assumption is
+	 * that disk access cost will also be equally shared between workers which
+	 * is generally true unless there are too many workers working on a
+	 * relatively lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for partial
+	 * sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_samplescan
  *	  Determines and returns the cost of scanning a relation using sampling.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..02d2392
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * expr_is_parallel_safe
+ *	  is a paraticular expression parallel safe
+ *
+ * Conditions checked here:
+ *
+ * 1. The expresion must not contain any parallel unsafe or parallel
+ * restricted functions.
+ *
+ * 2. The expression must not contain any initplan or subplan.  We can
+ * probably remove this restriction once we have support of infrastructure
+ * for execution of initplans and subplans at parallel (Gather) nodes.
+ */
+bool
+expr_is_parallel_safe(Node *node)
+{
+	if (check_parallel_safety(node, false))
+		return false;
+
+	if (contain_subplans_or_initplans(node))
+		return false;
+
+	return true;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+	ListCell   *l;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (max_parallel_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * Allow parallel paths only if all the clauses for relation are parallel
+	 * safe.  We can allow execution of parallel restricted clauses in master
+	 * backend, but for that planner should have infrastructure to pull all
+	 * the parallel restricted clauses from below nodes to the Gather node
+	 * which will then execute such clauses in master backend.
+	 */
+	foreach(l, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (!expr_is_parallel_safe((Node *) rinfo->clause))
+			return;
+	}
+
+	num_parallel_workers = Min(max_parallel_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the funnel path which master backend needs to execute. */
+	add_path(rel, (Path *) create_gather_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0ee7392..d06a826 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,8 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
 static Gather *create_gather_plan(PlannerInfo *root,
@@ -104,6 +106,8 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+					Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
 static Gather *make_gather(List *qptlist, List *qpqual,
@@ -237,6 +241,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -357,6 +362,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_SampleScan:
 			plan = (Plan *) create_samplescan_plan(root,
 												   best_path,
@@ -567,6 +579,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -1184,6 +1197,46 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan	   *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_samplescan_plan
  *	 Returns a samplescan plan for the base relation scanned by 'best_path'
  *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3478,6 +3531,24 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static SampleScan *
 make_samplescan(List *qptlist,
 				List *qpqual,
@@ -5169,6 +5240,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Append:
 		case T_MergeAppend:
 		case T_RecursiveUnion:
+		case T_Gather:
 			return false;
 		default:
 			break;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e1ee67c..4ebbabf 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -201,13 +201,13 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->hasRowSecurity = false;
 
 	/*
-	 * Assess whether it's feasible to use parallel mode for this query.
-	 * We can't do this in a standalone backend, or if the command will
-	 * try to modify any data, or if this is a cursor operation, or if any
+	 * Assess whether it's feasible to use parallel mode for this query. We
+	 * can't do this in a standalone backend, or if the command will try to
+	 * modify any data, or if this is a cursor operation, or if any
 	 * parallel-unsafe functions are present in the query tree.
 	 *
-	 * For now, we don't try to use parallel mode if we're running inside
-	 * a parallel worker.  We might eventually be able to relax this
+	 * For now, we don't try to use parallel mode if we're running inside a
+	 * parallel worker.  We might eventually be able to relax this
 	 * restriction, but for now it seems best not to have parallel workers
 	 * trying to create their own parallel workers.
 	 */
@@ -215,7 +215,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
 		parse->utilityStmt == NULL && !IsParallelWorker() &&
-		!contain_parallel_unsafe((Node *) parse);
+		!check_parallel_safety((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
@@ -228,9 +228,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 *
 	 * (It's been suggested that we should always impose these restrictions
 	 * whenever glob->parallelModeOK is true, so that it's easier to notice
-	 * incorrectly-labeled functions sooner.  That might be the right thing
-	 * to do, but for now I've taken this approach.  We could also control
-	 * this with a GUC.)
+	 * incorrectly-labeled functions sooner.  That might be the right thing to
+	 * do, but for now I've taken this approach.  We could also control this
+	 * with a GUC.)
 	 *
 	 * FIXME: It's assumed that code further down will set parallelModeNeeded
 	 * to true if a parallel path is actually chosen.  Since the core
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 9392d61..aff78fc 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -447,6 +447,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 6b32f85..b85e4f6 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2234,6 +2234,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..2355cc6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -87,16 +87,25 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+}	check_parallel_safety_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
 static bool find_window_functions_walker(Node *node, WindowFuncLists *lists);
 static bool expression_returns_set_rows_walker(Node *node, double *count);
 static bool contain_subplans_walker(Node *node, void *context);
+static bool contain_subplans_or_initplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool check_parallel_safety_walker(Node *node,
+							 check_parallel_safety_arg * context);
+static bool parallel_too_dangerous(char proparallel,
+					   check_parallel_safety_arg * context);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1204,13 +1213,16 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
  *****************************************************************************/
 
 bool
-contain_parallel_unsafe(Node *node)
+check_parallel_safety(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	check_parallel_safety_arg context;
+
+	context.allow_restricted = allow_restricted;
+	return check_parallel_safety_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+check_parallel_safety_walker(Node *node, check_parallel_safety_arg * context)
 {
 	if (node == NULL)
 		return false;
@@ -1218,7 +1230,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1227,7 +1239,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1236,7 +1248,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1245,7 +1257,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1254,7 +1266,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1268,12 +1280,12 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1282,7 +1294,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1294,28 +1306,77 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			if (parallel_too_dangerous(op_volatile(lfirst_oid(opid)), context))
 				return true;
 		}
 		/* else fall through to check args */
 	}
 	else if (IsA(node, Query))
 	{
-		Query *query = (Query *) node;
+		Query	   *query = (Query *) node;
 
 		if (query->rowMarks != NULL)
 			return true;
 
 		/* Recurse into subselects */
 		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
+								 check_parallel_safety_walker,
 								 context, 0);
 	}
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  check_parallel_safety_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, check_parallel_safety_arg * context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+/*
+ * contain_subplans_or_initplans
+ *	  Recursively search for initplan or subplan nodes within a clause.
+ *
+ * A special purpose function for prohibiting subplan or initplan clauses
+ * in parallel query constructs.
+ *
+ * If we see any form of SubPlan node, we will return TRUE.  For InitPlan's,
+ * we return true when we see the Param node, apart from that InitPlan
+ * can contain a simple NULL constant for MULTIEXPR subquery (see comments
+ * in make_subplan), however it is okay not to care about the same as that
+ * is only possible for Update statement which is anyway prohibited.
+ *
+ * Returns true if any subplan or initplan is found.
+ */
+bool
+contain_subplans_or_initplans(Node *clause)
+{
+	return contain_subplans_or_initplans_walker(clause, NULL);
+}
+
+static bool
+contain_subplans_or_initplans_walker(Node *node, void *context)
+{
+	if (node == NULL)
+		return false;
+	if (IsA(node, SubPlan) ||
+		IsA(node, AlternativeSubPlan) ||
+		IsA(node, SubLink))
+		return true;			/* abort the tree traversal and return true */
+	else if (IsA(node, Param))
+	{
+		Param	   *paramval = (Param *) node;
+
+		if (paramval->paramkind == PARAM_EXEC)
+			return true;
+	}
+	return expression_tree_walker(node, contain_subplans_or_initplans_walker, context);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1895a68..2fd7ae5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -712,6 +712,28 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_partialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_samplescan_path
  *	  Creates a path node for a sampled table scan.
  */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..77e5311
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc);
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+					   EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b6895f9..05ca049 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
@@ -1254,6 +1255,20 @@ typedef struct ScanState
  */
 typedef ScanState SeqScanState;
 
+/*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	bool		scan_initialized;		/* used to determine if the scan is
+										 * initialized */
+	ParallelHeapScanDesc	pscan;	/* parallel heap scan descriptor
+									 * for partial scan */
+	Size		pscan_len;		/* size of parallel heap scan descriptor */
+} PartialSeqScanState;
+
 /* ----------------
  *	 SampleScanState information
  * ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 94bdb7c..71496b9 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
 	T_SampleScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
 	T_SampleScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 1f9213c..a65cd22 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -72,7 +72,7 @@ typedef struct PlannedStmt
 
 	bool		hasRowSecurity; /* row security applied? */
 
-	bool		parallelModeNeeded; /* parallel mode required to execute? */
+	bool		parallelModeNeeded;		/* parallel mode required to execute? */
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -287,6 +287,12 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
  *		table sample scan node
  * ----------------
  */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..747b05b 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,8 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool check_parallel_safety(Node *node, bool allow_restricted);
+extern bool contain_subplans_or_initplans(Node *clause);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 25a7303..8640567 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -73,6 +73,9 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7a4940c..3b97b73 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..6cd4479 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,15 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer);
+extern bool expr_is_parallel_safe(Node *node);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..5492ba0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1199,6 +1199,8 @@ OverrideStackEntry
 PACE_HEADER
 PACL
 ParallelExecutorInfo
+PartialSeqScan
+PartialSeqScanState
 PATH
 PBOOL
 PCtxtHandle
#394Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#385)
Re: Parallel Seq Scan

On Mon, Oct 12, 2015 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Oct 11, 2015 at 7:56 PM, Noah Misch <noah@leadboat.com> wrote:

I see no mention in this thread of varatt_indirect, but I anticipated
datumSerialize() reacting to it the same way datumCopy() reacts. If
datumSerialize() can get away without doing so, why is that?

Good point. I don't think it can. Attached is a patch to fix that.
This patch also includes some somewhat-related changes to
plpgsql_param_fetch() upon which I would appreciate any input you can
provide.

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this
test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list.

From what I understood by looking at code in this area, I think the check
params != estate->paramLI and code under it is required for parameters
that are setup by setup_unshared_param_list(). Now unshared params
are only created for Cursors and expressions that are passing a R/W
object pointer; for cursors we explicitly prohibit the parallel
plan generation
and I am not sure if it makes sense to generate parallel plans for
expressions
involving R/W object pointer, if we don't generate parallel plan where
expressions involve such parameters, then SerializeParamList() should not
be affected by the check mentioned by you. Is by anychance, this is
happening because you are testing by forcing gather node on top of
all kind of plans?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#395Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#393)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 6:32 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Oct 14, 2015 at 3:29 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

I think this got messed up while rebasing on top of Gather node
changes, but nonetheless, I have changed it such that PartialSeqScan
node handling is after SeqScan.

Currently, the explain analyze of parallel seq scan plan is not showing the
allocated number of workers
including the planned workers.I feel this information is good for users in
understanding the performance
difference that is coming with parallel seq scan. It may be missed in
recent patch series. It was discussed
in[1]/messages/by-id/CA+TgmobhQ0_+YObMLbJexvt4QEf6XbLfUdaX1OwL-ivgaN5qxw@mail.gmail.com.

Currently there is no qualification evaluation at Result and Gather nodes,
because of this reason, if any
query that contains any parallel restricted functions is not chosen for
parallel scan. Because of
this reason, there is no difference between parallel restricted and
parallel unsafe functions currently.
Is it fine for first version?

[1]: /messages/by-id/CA+TgmobhQ0_+YObMLbJexvt4QEf6XbLfUdaX1OwL-ivgaN5qxw@mail.gmail.com
/messages/by-id/CA+TgmobhQ0_+YObMLbJexvt4QEf6XbLfUdaX1OwL-ivgaN5qxw@mail.gmail.com

Regards,
Hari Babu
Fujitsu Australia

#396Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#395)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 5:39 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Oct 15, 2015 at 6:32 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Wed, Oct 14, 2015 at 3:29 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

I think this got messed up while rebasing on top of Gather node
changes, but nonetheless, I have changed it such that PartialSeqScan
node handling is after SeqScan.

Currently, the explain analyze of parallel seq scan plan is not showing

the allocated number of workers

including the planned workers.I feel this information is good for users

in understanding the performance

difference that is coming with parallel seq scan. It may be missed in

recent patch series. It was discussed

in[1].

I am aware of that and purposefully kept it for a consecutive patch.
There are other things as well which I have left out from this patch
and those are:
a. Early stop of executor for Rescan purpose
b. Support of pushdown for plans containing InitPlan and SubPlans

Then there is more related work like
a. Support for prepared statements

Basically I think this could be done as add on patches once we are
done with basic patch.

Currently there is no qualification evaluation at Result and Gather

nodes, because of this reason, if any

query that contains any parallel restricted functions is not chosen for

parallel scan. Because of

this reason, there is no difference between parallel restricted and

parallel unsafe functions currently.

This requires new infrastructure to pull restricted functions from
lower nodes to Gather node, so I have left it for another day.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#397Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#394)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 7:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Oct 12, 2015 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Oct 11, 2015 at 7:56 PM, Noah Misch <noah@leadboat.com> wrote:

I see no mention in this thread of varatt_indirect, but I anticipated
datumSerialize() reacting to it the same way datumCopy() reacts. If
datumSerialize() can get away without doing so, why is that?

Good point. I don't think it can. Attached is a patch to fix that.
This patch also includes some somewhat-related changes to
plpgsql_param_fetch() upon which I would appreciate any input you can
provide.

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this
test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list.

From what I understood by looking at code in this area, I think the check
params != estate->paramLI and code under it is required for parameters
that are setup by setup_unshared_param_list(). Now unshared params
are only created for Cursors and expressions that are passing a R/W
object pointer; for cursors we explicitly prohibit the parallel plan
generation
and I am not sure if it makes sense to generate parallel plans for
expressions
involving R/W object pointer, if we don't generate parallel plan where
expressions involve such parameters, then SerializeParamList() should not
be affected by the check mentioned by you. Is by anychance, this is
happening because you are testing by forcing gather node on top of
all kind of plans?

Yeah, but I think the scenario is legitimate. When a query gets run
from within PL/pgsql, parallelism is an option, at least as we have
the code today. So if a Gather were present, and the query used a
parameter, then you could have this issue. For example:

SELECT * FROM bigtable WHERE unindexed_column = some_plpgsql_variable;

So this can happen, I think, even with parallel sequential scan only,
even if Gather node is not otherwise used.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#398Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#392)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 1:51 AM, Noah Misch <noah@leadboat.com> wrote:

Tests of prm->prmtype and paramLI->paramFetch appear superfluous. Given that
the paramListCopyHook callback would return a complete substitute
ParamListInfo, I wouldn't expect SerializeParamList() to examine the the
original paramLI->params at all. If that's correct, the paramListCopyHook
design sounds fine. However, its implementation will be more complex than
paramMask would have been.

Well, I think there are two use cases we care about. If the
ParamListInfo came from Bind parameters sent via a protocol message,
then it will neither have a copy method nor require one. If it came
from some source that plays fancy games, like PL/pgsql, then it needs
a safe way to copy the list.

This wouldn't require any
modification to the current plpgsql_param_fetch() at all, but the new
function would steal its bms_is_member() test. Furthermore, no user
of ParamListInfo other than plpgsql needs to care at all -- which,
with your proposals, they would.

To my knowledge, none of these approaches would compel existing users to care.
They would leave paramMask or paramListCopyHook NULL and get today's behavior.

Well, looking at this proposal:

Bitmapset *paramMask; /* if non-NULL, ignore params lacking a 1-bit */

I read that to imply that every consumer of ParamListInfo objects
would need to account for the possibility of getting one with a
non-NULL paramMask. Would it work to define this as "if non-NULL,
params lacking a 1-bit may be safely ignored"? Or some other tweak
that basically says that you don't need to care about this, but you
can if you want to.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#399Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#398)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 12:05:53PM -0400, Robert Haas wrote:

On Thu, Oct 15, 2015 at 1:51 AM, Noah Misch <noah@leadboat.com> wrote:

This wouldn't require any
modification to the current plpgsql_param_fetch() at all, but the new
function would steal its bms_is_member() test. Furthermore, no user
of ParamListInfo other than plpgsql needs to care at all -- which,
with your proposals, they would.

To my knowledge, none of these approaches would compel existing users to care.
They would leave paramMask or paramListCopyHook NULL and get today's behavior.

Well, looking at this proposal:

Bitmapset *paramMask; /* if non-NULL, ignore params lacking a 1-bit */

I read that to imply that every consumer of ParamListInfo objects
would need to account for the possibility of getting one with a
non-NULL paramMask.

Agreed. More specifically, I had in mind for copyParamList() to check the
mask while e.g. ExecEvalParamExtern() would either check nothing or merely
assert that any mask included the requested parameter. It would be tricky to
verify that as safe, so ...

Would it work to define this as "if non-NULL,
params lacking a 1-bit may be safely ignored"? Or some other tweak
that basically says that you don't need to care about this, but you
can if you want to.

... this is a better specification.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#400Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#396)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 15, 2015 at 5:39 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Oct 15, 2015 at 6:32 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Oct 14, 2015 at 3:29 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
I think this got messed up while rebasing on top of Gather node
changes, but nonetheless, I have changed it such that PartialSeqScan
node handling is after SeqScan.

Currently, the explain analyze of parallel seq scan plan is not showing
the allocated number of workers
including the planned workers.I feel this information is good for users in
understanding the performance
difference that is coming with parallel seq scan. It may be missed in
recent patch series. It was discussed
in[1].

I am aware of that and purposefully kept it for a consecutive patch.
There are other things as well which I have left out from this patch
and those are:
a. Early stop of executor for Rescan purpose
b. Support of pushdown for plans containing InitPlan and SubPlans

Then there is more related work like
a. Support for prepared statements

OK.

During the test with latest patch, I found a dead lock between worker
and backend
on relation lock. To minimize the test scenario, I changed the number
of pages required
to start one worker to 1 and all parallel cost parameters as zero.

Backend is waiting for the tuples from workers, workers are waiting on
lock of relation.
Attached is the sql script that can reproduce this issue.

Regards,
Hari Babu
Fujitsu Australia

Attachments:

parallel_hang_with_cluster.sqlapplication/octet-stream; name=parallel_hang_with_cluster.sqlDownload
#401Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Haribabu Kommi (#400)
2 attachment(s)
Re: Parallel Seq Scan

On Fri, Oct 16, 2015 at 2:10 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Thu, Oct 15, 2015 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Oct 15, 2015 at 5:39 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Oct 15, 2015 at 6:32 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Oct 14, 2015 at 3:29 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
I think this got messed up while rebasing on top of Gather node
changes, but nonetheless, I have changed it such that PartialSeqScan
node handling is after SeqScan.

Currently, the explain analyze of parallel seq scan plan is not showing
the allocated number of workers
including the planned workers.I feel this information is good for users in
understanding the performance
difference that is coming with parallel seq scan. It may be missed in
recent patch series. It was discussed
in[1].

I am aware of that and purposefully kept it for a consecutive patch.
There are other things as well which I have left out from this patch
and those are:
a. Early stop of executor for Rescan purpose
b. Support of pushdown for plans containing InitPlan and SubPlans

Then there is more related work like
a. Support for prepared statements

OK.

During the test with latest patch, I found a dead lock between worker
and backend
on relation lock. To minimize the test scenario, I changed the number
of pages required
to start one worker to 1 and all parallel cost parameters as zero.

Backend is waiting for the tuples from workers, workers are waiting on
lock of relation.
Attached is the sql script that can reproduce this issue.

Some more tests that failed in similar configuration settings.
1. Table that is created under a begin statement is not visible in the worker.
2. permission problem in worker side for set role command.

Regards,
Hari Babu
Fujitsu Australia

Attachments:

parallel_table_doesn't_exist_problem.sqlapplication/octet-stream; name=parallel_table_doesn't_exist_problem.sqlDownload
parallel_set_role_permission_problem.sqlapplication/octet-stream; name=parallel_set_role_permission_problem.sqlDownload
#402Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#381)
3 attachment(s)
Re: Parallel Seq Scan

On Mon, Oct 5, 2015 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

[ new patch for heapam.c changes ]

I went over this in a fair amount of detail and reworked it somewhat.
The result is attached as parallel-heapscan-revised.patch. I think
the way I did this is a bit cleaner than what you had, although it's
basically the same thing. There are fewer changes to initscan, and we
don't need one function to initialize the starting block that must be
called in each worker and then another one to get the next block, and
generally the changes are a bit more localized. I also went over the
comments and, I think, improved them. I tweaked the logic for
reporting the starting scan position as the last position report; I
think the way you had it the last report would be for one block
earlier. I'm pretty happy with this version and hope to commit it
soon.

There's a second patch attached here as well, parallel-relaunch.patch,
which makes it possible to relaunch workers with the same parallel
context. Currently, after you WaitForParallelWorkersToFinish(), you
must proceed without fail to DestroyParallelContext(). With this
rather simple patch, you have the option to instead go back and again
LaunchParallelWorkers(), which is nice because it avoids the overhead
of setting up a new DSM and filling it with all of your transaction
state a second time. I'd like to commit this as well, and I think we
should revise execParallel.c to use it.

Finally, I've attached some test code in parallel-dummy.patch. This
demonstrates how the code in 0001 and 0002 can be used. It scans a
relation, counts the tuples, and then gratuitously rescans it and
counts the tuples again. This shows that rescanning works and that
the syncscan position gets left in the right place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-dummy.patchapplication/x-patch; name=parallel-dummy.patchDownload
From d083f8a9e5a45ea5d13358e2d70ef9405caee62f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 15 Oct 2015 19:21:38 -0400
Subject: [PATCH 3/3] parallel_dummy

---
 contrib/parallel_dummy/Makefile                |  19 ++++
 contrib/parallel_dummy/parallel_dummy--1.0.sql |   7 ++
 contrib/parallel_dummy/parallel_dummy.c        | 151 +++++++++++++++++++++++++
 contrib/parallel_dummy/parallel_dummy.control  |   4 +
 4 files changed, 181 insertions(+)
 create mode 100644 contrib/parallel_dummy/Makefile
 create mode 100644 contrib/parallel_dummy/parallel_dummy--1.0.sql
 create mode 100644 contrib/parallel_dummy/parallel_dummy.c
 create mode 100644 contrib/parallel_dummy/parallel_dummy.control

diff --git a/contrib/parallel_dummy/Makefile b/contrib/parallel_dummy/Makefile
new file mode 100644
index 0000000..de00f50
--- /dev/null
+++ b/contrib/parallel_dummy/Makefile
@@ -0,0 +1,19 @@
+MODULE_big = parallel_dummy
+OBJS = parallel_dummy.o $(WIN32RES)
+PGFILEDESC = "parallel_dummy - dummy use of parallel infrastructure"
+
+EXTENSION = parallel_dummy
+DATA = parallel_dummy--1.0.sql
+
+REGRESS = parallel_dummy
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/parallel_dummy
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/parallel_dummy/parallel_dummy--1.0.sql b/contrib/parallel_dummy/parallel_dummy--1.0.sql
new file mode 100644
index 0000000..d49bd0f
--- /dev/null
+++ b/contrib/parallel_dummy/parallel_dummy--1.0.sql
@@ -0,0 +1,7 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION parallel_dummy" to load this file. \quit
+
+CREATE FUNCTION parallel_count(rel pg_catalog.regclass,
+							  nworkers pg_catalog.int4)
+    RETURNS pg_catalog.int8 STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/contrib/parallel_dummy/parallel_dummy.c b/contrib/parallel_dummy/parallel_dummy.c
new file mode 100644
index 0000000..a87a573
--- /dev/null
+++ b/contrib/parallel_dummy/parallel_dummy.c
@@ -0,0 +1,151 @@
+/*--------------------------------------------------------------------------
+ *
+ * parallel_dummy.c
+ *		Test harness code for parallel mode code.
+ *
+ * Copyright (C) 2013-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/parallel_dummy/parallel_dummy.c
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/xact.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(parallel_count);
+
+#define		TOC_SCAN_KEY			1
+#define		TOC_RESULT_KEY			2
+
+void		_PG_init(void);
+void		count_worker_main(dsm_segment *seg, shm_toc *toc);
+
+static void count_helper(HeapScanDesc pscan, int64 *resultp);
+
+Datum
+parallel_count(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int32		nworkers = PG_GETARG_INT32(1);
+	int64	   *resultp;
+	int64		result;
+	ParallelContext *pcxt;
+	ParallelHeapScanDesc pscan;
+	HeapScanDesc	scan;
+	Relation	rel;
+	Size		sz;
+
+	if (nworkers < 0)
+		ereport(ERROR,
+				(errmsg("number of parallel workers must be non-negative")));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	EnterParallelMode();
+
+	pcxt = CreateParallelContextForExternalFunction("parallel_dummy",
+											 "count_worker_main",
+											 nworkers);
+	sz = heap_parallelscan_estimate(GetActiveSnapshot());
+	shm_toc_estimate_chunk(&pcxt->estimator, sz);
+	shm_toc_estimate_chunk(&pcxt->estimator, sizeof(int64));
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+	InitializeParallelDSM(pcxt);
+	pscan = shm_toc_allocate(pcxt->toc, sz);
+	heap_parallelscan_initialize(pscan, rel, GetActiveSnapshot());
+	shm_toc_insert(pcxt->toc, TOC_SCAN_KEY, pscan);
+	resultp = shm_toc_allocate(pcxt->toc, sizeof(int64));
+	shm_toc_insert(pcxt->toc, TOC_RESULT_KEY, resultp);
+
+	LaunchParallelWorkers(pcxt);
+
+	scan = heap_beginscan_parallel(rel, pscan);
+
+	/* here's where we do the "real work" ... */
+	count_helper(scan, resultp);
+
+	WaitForParallelWorkersToFinish(pcxt);
+
+	heap_rescan(scan, NULL);
+	*resultp = 0;
+
+	LaunchParallelWorkers(pcxt);
+
+	result = *resultp;
+
+	/* and here's where we do the real work again */
+	count_helper(scan, resultp);
+
+	WaitForParallelWorkersToFinish(pcxt);
+
+	result = *resultp;
+
+	heap_endscan(scan);
+
+	DestroyParallelContext(pcxt);
+
+	relation_close(rel, AccessShareLock);
+
+	ExitParallelMode();
+
+	PG_RETURN_INT64(result);
+}
+
+void
+count_worker_main(dsm_segment *seg, shm_toc *toc)
+{
+	ParallelHeapScanDesc	pscan;
+	Relation	rel;
+	HeapScanDesc	scan;
+	int64	   *resultp;
+
+	pscan = shm_toc_lookup(toc, TOC_SCAN_KEY);
+	resultp = shm_toc_lookup(toc, TOC_RESULT_KEY);
+	Assert(pscan != NULL && resultp != NULL);
+
+	rel = relation_open(pscan->phs_relid, AccessShareLock);
+	scan = heap_beginscan_parallel(rel, pscan);
+	count_helper(scan, resultp);
+	heap_endscan(scan);
+	relation_close(rel, AccessShareLock);
+}
+
+static void
+count_helper(HeapScanDesc scan, int64 *resultp)
+{
+	int64		mytuples = 0;
+	BlockNumber	firstblock = InvalidBlockNumber;
+
+	for (;;)
+	{
+		HeapTuple	tup = heap_getnext(scan, ForwardScanDirection);
+
+		if (!HeapTupleIsValid(tup))
+			break;
+		if (firstblock == InvalidBlockNumber)
+			firstblock = scan->rs_cblock;
+
+		++mytuples;
+	}
+
+	SpinLockAcquire(&scan->rs_parallel->phs_mutex); /* dirty hack */
+	*resultp += mytuples;
+	SpinLockRelease(&scan->rs_parallel->phs_mutex);
+
+	elog(NOTICE, "PID %d counted " INT64_FORMAT " tuples starting at block %u",
+		MyProcPid, mytuples, firstblock);
+}
diff --git a/contrib/parallel_dummy/parallel_dummy.control b/contrib/parallel_dummy/parallel_dummy.control
new file mode 100644
index 0000000..90bae3f
--- /dev/null
+++ b/contrib/parallel_dummy/parallel_dummy.control
@@ -0,0 +1,4 @@
+comment = 'Dummy parallel code'
+default_version = '1.0'
+module_pathname = '$libdir/parallel_dummy'
+relocatable = true
-- 
2.3.8 (Apple Git-58)

parallel-heapscan-revised.patchapplication/x-patch; name=parallel-heapscan-revised.patchDownload
From a826e52d7c2c0213866007e05b20a0c3b3f1eb77 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 15 Oct 2015 19:21:22 -0400
Subject: [PATCH 2/3] Parallel heapscan stuff.

---
 src/backend/access/heap/heapam.c | 244 +++++++++++++++++++++++++++++++++++++--
 src/include/access/heapam.h      |   8 +-
 src/include/access/relscan.h     |  20 ++++
 3 files changed, 261 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..66deb1f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -63,6 +63,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/spin.h"
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
@@ -80,12 +81,14 @@ bool		synchronize_seqscans = true;
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
 						bool temp_snap);
+static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -226,7 +229,10 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 * results for a non-MVCC snapshot, the caller must hold some higher-level
 	 * lock that ensures the interesting tuple(s) won't change.)
 	 */
-	scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
+	if (scan->rs_parallel != NULL)
+		scan->rs_nblocks = scan->rs_parallel->phs_nblocks;
+	else
+		scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
 
 	/*
 	 * If the table is large relative to NBuffers, use a bulk-read access
@@ -237,7 +243,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 * behaviors, independently of the size of the table; also there is a GUC
 	 * variable that can disable synchronized scanning.)
 	 *
-	 * During a rescan, don't make a new strategy object if we don't have to.
+	 * Note that heap_parallelscan_initialize has a very similar test; if you
+	 * change this, consider changing that one, too.
 	 */
 	if (!RelationUsesLocalBuffers(scan->rs_rd) &&
 		scan->rs_nblocks > NBuffers / 4)
@@ -250,6 +257,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 
 	if (allow_strat)
 	{
+		/* During a rescan, keep the previous strategy object. */
 		if (scan->rs_strategy == NULL)
 			scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);
 	}
@@ -260,7 +268,12 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		scan->rs_strategy = NULL;
 	}
 
-	if (keep_startblock)
+	if (scan->rs_parallel != NULL)
+	{
+		/* For parallel scan, believe whatever ParallelHeapScanDesc says. */
+		scan->rs_syncscan = scan->rs_parallel->phs_syncscan;
+	}
+	else if (keep_startblock)
 	{
 		/*
 		 * When rescanning, we want to keep the previous startblock setting,
@@ -496,7 +509,20 @@ heapgettup(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/* Other processes might have already finished the scan. */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineoff = FirstOffsetNumber;		/* first offnum */
 			scan->rs_inited = true;
@@ -519,6 +545,9 @@ heapgettup(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -669,6 +698,11 @@ heapgettup(HeapScanDesc scan,
 				page = scan->rs_nblocks;
 			page--;
 		}
+		else if (scan->rs_parallel != NULL)
+		{
+			page = heap_parallelscan_nextpage(scan);
+			finished = (page == InvalidBlockNumber);
+		}
 		else
 		{
 			page++;
@@ -773,7 +807,20 @@ heapgettup_pagemode(HeapScanDesc scan,
 				tuple->t_data = NULL;
 				return;
 			}
-			page = scan->rs_startblock; /* first page */
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/* Other processes might have already finished the scan. */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
 			heapgetpage(scan, page);
 			lineindex = 0;
 			scan->rs_inited = true;
@@ -793,6 +840,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 	}
 	else if (backward)
 	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
 		if (!scan->rs_inited)
 		{
 			/*
@@ -932,6 +982,11 @@ heapgettup_pagemode(HeapScanDesc scan,
 				page = scan->rs_nblocks;
 			page--;
 		}
+		else if (scan->rs_parallel != NULL)
+		{
+			page = heap_parallelscan_nextpage(scan);
+			finished = (page == InvalidBlockNumber);
+		}
 		else
 		{
 			page++;
@@ -1341,7 +1396,7 @@ HeapScanDesc
 heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, false);
 }
 
@@ -1351,7 +1406,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Oid			relid = RelationGetRelid(relation);
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   true, true, true, false, false, true);
 }
 
@@ -1360,7 +1415,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
 								   false, false, false);
 }
@@ -1369,7 +1424,7 @@ HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   false, false, true, true, false, false);
 }
 
@@ -1378,7 +1433,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 					  bool allow_strat, bool allow_sync, bool allow_pagemode)
 {
-	return heap_beginscan_internal(relation, snapshot, nkeys, key,
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
 								   false, true, false);
 }
@@ -1386,6 +1441,7 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
+						ParallelHeapScanDesc parallel_scan,
 						bool allow_strat,
 						bool allow_sync,
 						bool allow_pagemode,
@@ -1418,6 +1474,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_parallel = parallel_scan;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1473,6 +1530,25 @@ heap_rescan(HeapScanDesc scan,
 	 * reinitialize scan descriptor
 	 */
 	initscan(scan, key, true);
+
+	/*
+	 * reset parallel scan, if present
+	 */
+	if (scan->rs_parallel != NULL)
+	{
+		ParallelHeapScanDesc parallel_scan;
+
+		/*
+		 * Caller is responsible for making sure that all workers have
+		 * finished the scan before calling this, so it really shouldn't be
+		 * necessary to acquire the mutex at all.  We acquire it anyway, just
+		 * to be tidy.
+		 */
+		parallel_scan = scan->rs_parallel;
+		SpinLockAcquire(&parallel_scan->phs_mutex);
+		parallel_scan->phs_cblock = parallel_scan->phs_startblock;
+		SpinLockRelease(&parallel_scan->phs_mutex);
+	}
 }
 
 /* ----------------
@@ -1532,6 +1608,154 @@ heap_endscan(HeapScanDesc scan)
 }
 
 /* ----------------
+ *		heap_parallelscan_estimate - estimate storage for ParallelHeapScanDesc
+ *
+ *		Sadly, this doesn't reduce to a constant, because the size required
+ *		to serialize the snapshot can vary.
+ * ----------------
+ */
+Size
+heap_parallelscan_estimate(Snapshot snapshot)
+{
+	return add_size(offsetof(ParallelHeapScanDescData, phs_snapshot_data),
+					EstimateSnapshotSpace(snapshot));
+}
+
+/* ----------------
+ *		heap_parallelscan_initialize - initialize ParallelHeapScanDesc
+ *
+ *		Must allow as many bytes of shared memory as returned by
+ *		heap_parallelscan_estimate.  Call this just once in the leader
+ *		process; then, individual workers attach via heap_beginscan_parallel.
+ * ----------------
+ */
+void
+heap_parallelscan_initialize(ParallelHeapScanDesc target, Relation relation,
+							 Snapshot snapshot)
+{
+	target->phs_relid = RelationGetRelid(relation);
+	target->phs_nblocks = RelationGetNumberOfBlocks(relation);
+	/* compare phs_syncscan initialization to similar logic in initscan */
+	target->phs_syncscan = synchronize_seqscans &&
+		!RelationUsesLocalBuffers(relation) &&
+		target->phs_nblocks > NBuffers / 4;
+	SpinLockInit(&target->phs_mutex);
+	target->phs_cblock = InvalidBlockNumber;
+	target->phs_startblock = InvalidBlockNumber;
+	SerializeSnapshot(snapshot, target->phs_snapshot_data);
+}
+
+/* ----------------
+ *		heap_beginscan_parallel - join a parallel scan
+ *
+ *		Caller must hold a suitable lock on the correct relation.
+ * ----------------
+ */
+HeapScanDesc
+heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
+{
+	Snapshot	snapshot;
+
+	Assert(RelationGetRelid(relation) == parallel_scan->phs_relid);
+	snapshot = RestoreSnapshot(parallel_scan->phs_snapshot_data);
+	RegisterSnapshot(snapshot);
+
+	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
+								   true, true, true, false, false, true);
+}
+
+/* ----------------
+ *		heap_parallelscan_nextpage - get the next page to scan
+ *
+ *		Get the next page to scan.  Even if there are no pages left to scan,
+ *		another backend could have grabbed a page to scan and not yet finished
+ *		looking at it, so it doesn't follow that the scan is done when the
+ *		first backend gets an InvalidBlockNumber return.
+ * ----------------
+ */
+static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan)
+{
+	BlockNumber page = InvalidBlockNumber;
+	BlockNumber sync_startpage = InvalidBlockNumber;
+	BlockNumber	report_page = InvalidBlockNumber;
+	ParallelHeapScanDesc parallel_scan;
+
+	Assert(scan->rs_parallel);
+	parallel_scan = scan->rs_parallel;
+
+retry:
+	/* Grab the spinlock. */
+	SpinLockAcquire(&parallel_scan->phs_mutex);
+
+	/*
+	 * If the scan's startblock has not yet been initialized, we must do so
+	 * now.  If this is not a synchronized scan, we just start at block 0, but
+	 * if it is a synchronized scan, we must get the starting position from
+	 * the synchronized scan machinery.  We can't hold the spinlock while
+	 * doing that, though, so release the spinlock, get the information we
+	 * need, and retry.  If nobody else has initialized the scan in the
+	 * meantime, we'll fill in the value we fetched on the second time
+	 * through.
+	 */
+	if (parallel_scan->phs_startblock == InvalidBlockNumber)
+	{
+		if (!parallel_scan->phs_syncscan)
+			parallel_scan->phs_startblock = 0;
+		else if (sync_startpage != InvalidBlockNumber)
+			parallel_scan->phs_startblock = sync_startpage;
+		else
+		{
+			SpinLockRelease(&parallel_scan->phs_mutex);
+			sync_startpage = ss_get_location(scan->rs_rd, scan->rs_nblocks);
+			goto retry;
+		}
+		parallel_scan->phs_cblock = parallel_scan->phs_startblock;
+	}
+
+	/*
+	 * The current block number is the next one that needs to be scanned,
+	 * unless it's InvalidBlockNumber already, in which case there are no more
+	 * blocks to scan.  After remembering the current value, we must advance
+	 * it so that the next call to this function returns the next block to be
+	 * scanned.
+	 */
+	page = parallel_scan->phs_cblock;
+	if (page != InvalidBlockNumber)
+	{
+		parallel_scan->phs_cblock++;
+		if (parallel_scan->phs_cblock >= scan->rs_nblocks)
+			parallel_scan->phs_cblock = 0;
+		if (parallel_scan->phs_cblock == parallel_scan->phs_startblock)
+		{
+			parallel_scan->phs_cblock = InvalidBlockNumber;
+			report_page = parallel_scan->phs_startblock;
+		}
+	}
+
+	/* Release the lock. */
+	SpinLockRelease(&parallel_scan->phs_mutex);
+
+	/*
+	 * Report scan location.  Normally, we report the current page number.
+	 * When we reach the end of the scan, though, we report the starting page,
+	 * not the ending page, just so the starting positions for later scans
+	 * doesn't slew backwards.  We only report the position at the end of the
+	 * scan once, though: subsequent callers will have report nothing, since
+	 * they will have page == InvalidBlockNumber.
+	 */
+	if (scan->rs_syncscan)
+	{
+		if (report_page == InvalidBlockNumber)
+			report_page = page;
+		if (report_page != InvalidBlockNumber)
+			ss_report_location(scan->rs_rd, report_page);
+	}
+
+	return page;
+}
+
+/* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
  *		Fix to work with index relations.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 75e6b72..98eeadd 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -96,8 +96,9 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 
 #define heap_close(r,l)  relation_close(r,l)
 
-/* struct definition appears in relscan.h */
+/* struct definitions appear in relscan.h */
 typedef struct HeapScanDescData *HeapScanDesc;
+typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
 /*
  * HeapScanIsValid
@@ -126,6 +127,11 @@ extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
 
+extern Size heap_parallelscan_estimate(Snapshot snapshot);
+extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
+							 Relation relation, Snapshot snapshot);
+extern HeapScanDesc heap_beginscan_parallel(Relation, ParallelHeapScanDesc);
+
 extern bool heap_fetch(Relation relation, Snapshot snapshot,
 		   HeapTuple tuple, Buffer *userbuf, bool keep_buf,
 		   Relation stats_relation);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 6e62319..356c7e6 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,25 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+/*
+ * Shared state for parallel heap scan.
+ *
+ * Each backend participating in a parallel heap scan has its own
+ * HeapScanDesc in backend-private memory, and those objects all contain
+ * a pointer to this structure.  The information here must be sufficient
+ * to properly initialize each new HeapScanDesc as workers join the scan,
+ * and it must act as a font of block numbers for those workers.
+ */
+typedef struct ParallelHeapScanDescData
+{
+	Oid			phs_relid;		/* OID of relation to scan */
+	bool		phs_syncscan;	/* report location to syncscan logic? */
+	BlockNumber phs_nblocks;	/* # blocks in relation at start of scan */
+	slock_t		phs_mutex;		/* mutual exclusion for block number fields */
+	BlockNumber phs_startblock; /* starting block number */
+	BlockNumber phs_cblock;		/* current block number */
+	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+}	ParallelHeapScanDescData;
 
 typedef struct HeapScanDescData
 {
@@ -49,6 +68,7 @@ typedef struct HeapScanDescData
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
+	ParallelHeapScanDesc rs_parallel;	/* parallel scan information */
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
-- 
2.3.8 (Apple Git-58)

parallel-relaunch.patchapplication/x-patch; name=parallel-relaunch.patchDownload
From 2b1069b3ad1febc7ec76615607087d1484a8dcba Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 15 Oct 2015 19:19:40 -0400
Subject: [PATCH 1/3] Make it possible to relaunch workers for an existing
 parallel context.

---
 src/backend/access/transam/README.parallel |  5 ++++
 src/backend/access/transam/parallel.c      | 46 ++++++++++++++++++++++++++++++
 src/include/access/parallel.h              |  1 +
 3 files changed, 52 insertions(+)

diff --git a/src/backend/access/transam/README.parallel b/src/backend/access/transam/README.parallel
index 1005186..dfcbafa 100644
--- a/src/backend/access/transam/README.parallel
+++ b/src/backend/access/transam/README.parallel
@@ -221,3 +221,8 @@ pattern looks like this:
 	DestroyParallelContext(pcxt);
 
 	ExitParallelMode();
+
+If desired, after WaitForParallelWorkersToFinish() has been called, another
+call to LaunchParallelWorkers() can be made using the same parallel context.
+Calls to these two functions can be alternated any number of times before
+destroying the parallel context.
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 29d6ed5..8363c0e 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -392,6 +392,49 @@ LaunchParallelWorkers(ParallelContext *pcxt)
 	/* We might be running in a short-lived memory context. */
 	oldcontext = MemoryContextSwitchTo(TopTransactionContext);
 
+	/*
+	 * This function can be called for a parallel context for which it has
+	 * already been called previously, but only if all of the old workers
+	 * have already exited.  When this case arises, we need to do some extra
+	 * reinitialization.
+	 */
+	if (pcxt->nworkers_launched > 0)
+	{
+		FixedParallelState *fps;
+		char	   *error_queue_space;
+
+		/* Clean out old worker handles. */
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			if (pcxt->worker[i].error_mqh != NULL)
+				elog(ERROR, "previously launched worker still alive");
+			if (pcxt->worker[i].bgwhandle != NULL)
+			{
+				pfree(pcxt->worker[i].bgwhandle);
+				pcxt->worker[i].bgwhandle = NULL;
+			}
+		}
+
+		/* Reset a few bits of fixed parallel state to a clean state. */
+		fps = shm_toc_lookup(pcxt->toc, PARALLEL_KEY_FIXED);
+		fps->workers_attached = 0;
+		fps->last_xlog_end = 0;
+
+		/* Recreate error queues. */
+		error_queue_space =
+			shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE);
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			char	   *start;
+			shm_mq	   *mq;
+
+			start = error_queue_space + i * PARALLEL_ERROR_QUEUE_SIZE;
+			mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE);
+			shm_mq_set_receiver(mq, MyProc);
+			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
+		}
+	}
+
 	/* Configure a worker. */
 	snprintf(worker.bgw_name, BGW_MAXLEN, "parallel worker for PID %d",
 			 MyProcPid);
@@ -416,8 +459,11 @@ LaunchParallelWorkers(ParallelContext *pcxt)
 		if (!any_registrations_failed &&
 			RegisterDynamicBackgroundWorker(&worker,
 											&pcxt->worker[i].bgwhandle))
+		{
 			shm_mq_set_handle(pcxt->worker[i].error_mqh,
 							  pcxt->worker[i].bgwhandle);
+			pcxt->nworkers_launched++;
+		}
 		else
 		{
 			/*
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index b029c1e..57635c8 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -35,6 +35,7 @@ typedef struct ParallelContext
 	dlist_node	node;
 	SubTransactionId subid;
 	int			nworkers;
+	int			nworkers_launched;
 	parallel_worker_main_type entrypoint;
 	char	   *library_name;
 	char	   *function_name;
-- 
2.3.8 (Apple Git-58)

#403Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu Kommi (#400)
Re: Parallel Seq Scan

On Fri, Oct 16, 2015 at 8:40 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Thu, Oct 15, 2015 at 11:45 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Oct 15, 2015 at 5:39 PM, Haribabu Kommi <

kommi.haribabu@gmail.com>

wrote:

I am aware of that and purposefully kept it for a consecutive patch.
There are other things as well which I have left out from this patch
and those are:
a. Early stop of executor for Rescan purpose
b. Support of pushdown for plans containing InitPlan and SubPlans

Then there is more related work like
a. Support for prepared statements

OK.

During the test with latest patch, I found a dead lock between worker
and backend
on relation lock. To minimize the test scenario, I changed the number
of pages required
to start one worker to 1 and all parallel cost parameters as zero.

Backend is waiting for the tuples from workers, workers are waiting on
lock of relation.

The main reason is that we need some heavyweight lock handling so that
workers shouldn't wait for locks acquired by master-backend either by
group locking or something else, the work for which is already in-progress.

You can refer 0004-Partial-group-locking-implementation.patch in mail [1]/messages/by-id/CA+TgmoapgKdy_Z0W9mHqZcGSo2t_t-4_V36DXaKim+X_fYp0oQ@mail.gmail.com
sent by Robert to fix few parallelism related issues. For this kind of
issues,
that patch will be sufficient, but we still need more work in deadlock, so
you still might see few issues in that area.

Thanks for your continuous effort on this patch both by review as well as by
testing it.

[1]: /messages/by-id/CA+TgmoapgKdy_Z0W9mHqZcGSo2t_t-4_V36DXaKim+X_fYp0oQ@mail.gmail.com
/messages/by-id/CA+TgmoapgKdy_Z0W9mHqZcGSo2t_t-4_V36DXaKim+X_fYp0oQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#404Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#397)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 6:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 15, 2015 at 7:00 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

From what I understood by looking at code in this area, I think the

check

params != estate->paramLI and code under it is required for parameters
that are setup by setup_unshared_param_list(). Now unshared params
are only created for Cursors and expressions that are passing a R/W
object pointer; for cursors we explicitly prohibit the parallel plan
generation
and I am not sure if it makes sense to generate parallel plans for
expressions
involving R/W object pointer, if we don't generate parallel plan where
expressions involve such parameters, then SerializeParamList() should

not

be affected by the check mentioned by you. Is by anychance, this is
happening because you are testing by forcing gather node on top of
all kind of plans?

Yeah, but I think the scenario is legitimate. When a query gets run
from within PL/pgsql, parallelism is an option, at least as we have
the code today. So if a Gather were present, and the query used a
parameter, then you could have this issue. For example:

SELECT * FROM bigtable WHERE unindexed_column = some_plpgsql_variable;

I don't think for such statements the control flow will set up an unshared
param list. I have tried couple of such statements [1]1. create or replace function parallel_func_params() returns integer as $$ declare param_val int; ret_val int; begin param_val := 1000; Select c1 into ret_val from t1 where c1 = param_val; return ret_val; end; $$ language plpgsql; and found that
always such parameters are set up by setup_param_list(). I think there
are only two possibilities which could lead to setting up of unshared
params:

1. Usage of cursors - This is already prohibited for parallel-mode.
2. Usage of read-write-param - This only happens for expressions like
x := array_append(x, foo) (Refer exec_check_rw_parameter()). Read-write
params are not used for SQL statements. So this also won't be used for
parallel-mode

There is a chance that I might be missing some case where unshared
params will be required for parallel-mode (as of today), but if not then
I think we can live without current changes.

[1]: 1. create or replace function parallel_func_params() returns integer as $$ declare param_val int; ret_val int; begin param_val := 1000; Select c1 into ret_val from t1 where c1 = param_val; return ret_val; end; $$ language plpgsql;
1.
create or replace function parallel_func_params() returns integer
as $$
declare
param_val int;
ret_val int;
begin
param_val := 1000;
Select c1 into ret_val from t1 where c1 = param_val;
return ret_val;
end;
$$ language plpgsql;

For such a query, it will go in setup_param_list()

2.
create or replace function parallel_func_params_1() returns integer
as $$
declare
param_val int;
ret_val int;
begin
param_val := 1000;
Execute 'Select count(c1) from t1 where c1 = $1' Into ret_val Using
param_val;
return ret_val;
end;
$$ language plpgsql;

3.
create or replace function parallel_func_params_2() returns integer
as $$
declare
param_val int;
ret_val int;
row_var t1%ROWTYPE;
begin
param_val := 1000;
Select * into row_var from t1 where c1 = param_val;
return ret_val;
end;
$$ language plpgsql;

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#405Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#402)
Re: Parallel Seq Scan

On Fri, Oct 16, 2015 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 5, 2015 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

[ new patch for heapam.c changes ]

I went over this in a fair amount of detail and reworked it somewhat.
The result is attached as parallel-heapscan-revised.patch. I think
the way I did this is a bit cleaner than what you had, although it's
basically the same thing. There are fewer changes to initscan, and we
don't need one function to initialize the starting block that must be
called in each worker and then another one to get the next block, and
generally the changes are a bit more localized. I also went over the
comments and, I think, improved them. I tweaked the logic for
reporting the starting scan position as the last position report; I
think the way you had it the last report would be for one block
earlier. I'm pretty happy with this version and hope to commit it
soon.

static BlockNumber
+heap_parallelscan_nextpage(HeapScanDesc scan)
+{
+ BlockNumber page = InvalidBlockNumber;
+ BlockNumber sync_startpage = InvalidBlockNumber;
+ BlockNumber report_page = InvalidBlockNumber;
..
..
+ * Report scan location.  Normally, we report the current page number.
+ * When we reach the end of the scan, though, we report the starting page,
+ * not the ending page, just so the starting positions for later scans
+ * doesn't slew backwards.  We only report the position at the end of the
+ * scan once, though: subsequent callers will have report nothing, since
+ * they will have page == InvalidBlockNumber.
+ */
+ if (scan->rs_syncscan)
+ {
+ if (report_page == InvalidBlockNumber)
+ report_page = page;
+ if (report_page != InvalidBlockNumber)
+ ss_report_location(scan->rs_rd, report_page);
+ }
..
}

I think due to above changes it will report sync location on each page
scan, don't we want to report it once at end of scan?

Other than that, patch looks okay to me.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#406Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#405)
Re: Parallel Seq Scan

On Fri, Oct 16, 2015 at 7:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think due to above changes it will report sync location on each page
scan, don't we want to report it once at end of scan?

I think reporting for each page is correct. Isn't that what the
non-parallel case does?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#407Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#401)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 11:38 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

Some more tests that failed in similar configuration settings.
1. Table that is created under a begin statement is not visible in the worker.
2. permission problem in worker side for set role command.

The second problem, too, I have already posted a bug fix for, on a
thread which also contains a whole bunch of other bug fixes. I'll get
those committed today so you can avoid wasting time finding bugs I've
already found and fixed. Thanks for testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#408Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#406)
Re: Parallel Seq Scan

On Fri, Oct 16, 2015 at 5:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 16, 2015 at 7:42 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think due to above changes it will report sync location on each page
scan, don't we want to report it once at end of scan?

I think reporting for each page is correct. Isn't that what the
non-parallel case does?

Yes, sorry I got confused.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#409Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#404)
Re: Parallel Seq Scan

On Fri, Oct 16, 2015 at 2:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, but I think the scenario is legitimate. When a query gets run
from within PL/pgsql, parallelism is an option, at least as we have
the code today. So if a Gather were present, and the query used a
parameter, then you could have this issue. For example:

SELECT * FROM bigtable WHERE unindexed_column = some_plpgsql_variable;

I don't think for such statements the control flow will set up an unshared
param list. I have tried couple of such statements [1] and found that
always such parameters are set up by setup_param_list(). I think there
are only two possibilities which could lead to setting up of unshared
params:

1. Usage of cursors - This is already prohibited for parallel-mode.
2. Usage of read-write-param - This only happens for expressions like
x := array_append(x, foo) (Refer exec_check_rw_parameter()). Read-write
params are not used for SQL statements. So this also won't be used for
parallel-mode

There is a chance that I might be missing some case where unshared
params will be required for parallel-mode (as of today), but if not then
I think we can live without current changes.

*shrug*

The gather-test stuff isn't failing for no reason. Either PL/pgsql
shouldn't be passing CURSOR_OPT_PARALLEL_OK, or having a parallel plan
get generated there should work. There's not a third option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#410Noah Misch
noah@leadboat.com
In reply to: Amit Kapila (#394)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 04:30:01PM +0530, Amit Kapila wrote:

On Mon, Oct 12, 2015 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this
test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list.

From what I understood by looking at code in this area, I think the check
params != estate->paramLI and code under it is required for parameters
that are setup by setup_unshared_param_list(). Now unshared params
are only created for Cursors and expressions that are passing a R/W
object pointer; for cursors we explicitly prohibit the parallel
plan generation
and I am not sure if it makes sense to generate parallel plans for
expressions
involving R/W object pointer, if we don't generate parallel plan where
expressions involve such parameters, then SerializeParamList() should not
be affected by the check mentioned by you.

The trouble comes from the opposite direction. A setup_unshared_param_list()
list is fine under today's code, but a shared param list needs more help. To
say it another way, parallel queries that use the shared estate->paramLI need,
among other help, the logic now guarded by "params != estate->paramLI".

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#411Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#410)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 6:15 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Oct 15, 2015 at 04:30:01PM +0530, Amit Kapila wrote:

On Mon, Oct 12, 2015 at 9:16 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this
test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list.

From what I understood by looking at code in this area, I think the

check

params != estate->paramLI and code under it is required for parameters
that are setup by setup_unshared_param_list(). Now unshared params
are only created for Cursors and expressions that are passing a R/W
object pointer; for cursors we explicitly prohibit the parallel
plan generation
and I am not sure if it makes sense to generate parallel plans for
expressions
involving R/W object pointer, if we don't generate parallel plan where
expressions involve such parameters, then SerializeParamList() should

not

be affected by the check mentioned by you.

The trouble comes from the opposite direction. A

setup_unshared_param_list()

list is fine under today's code, but a shared param list needs more

help. To

say it another way, parallel queries that use the shared estate->paramLI

need,

among other help, the logic now guarded by "params != estate->paramLI".

Why would a parallel query need such a logic, that logic is needed mainly
for cursor with params and we don't want a parallelize such cases?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#412Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#409)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 2:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 16, 2015 at 2:29 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Yeah, but I think the scenario is legitimate. When a query gets run
from within PL/pgsql, parallelism is an option, at least as we have
the code today. So if a Gather were present, and the query used a
parameter, then you could have this issue. For example:

SELECT * FROM bigtable WHERE unindexed_column = some_plpgsql_variable;

I don't think for such statements the control flow will set up an

unshared

param list. I have tried couple of such statements [1] and found that
always such parameters are set up by setup_param_list(). I think there
are only two possibilities which could lead to setting up of unshared
params:

1. Usage of cursors - This is already prohibited for parallel-mode.
2. Usage of read-write-param - This only happens for expressions like
x := array_append(x, foo) (Refer exec_check_rw_parameter()). Read-write
params are not used for SQL statements. So this also won't be used for
parallel-mode

There is a chance that I might be missing some case where unshared
params will be required for parallel-mode (as of today), but if not then
I think we can live without current changes.

*shrug*

The gather-test stuff isn't failing for no reason. Either PL/pgsql
shouldn't be passing CURSOR_OPT_PARALLEL_OK, or having a parallel plan
get generated there should work. There's not a third option.

Agreed and on looking at code, I think in below code, if we pass
parallelOK as true for the cases where Portal is non-null, such a
problem could happen.

static int

exec_run_select(PLpgSQL_execstate *estate,

PLpgSQL_expr *expr, long maxtuples, Portal *portalP,

bool parallelOK)

{

ParamListInfo paramLI;

int rc;

/*

* On the first call for this expression generate the plan

*/

if (expr->plan == NULL)

exec_prepare_plan(estate, expr, parallelOK ?

CURSOR_OPT_PARALLEL_OK : 0);

/*

* If a portal was requested, put the query into the portal

*/

if (portalP != NULL)

{

/*

* Set up short-lived ParamListInfo

*/

paramLI = setup_unshared_param_list(estate, expr);

*portalP = SPI_cursor_open_with_paramlist(NULL, expr->plan,

paramLI,

estate->readonly_func);

and one such case is

exec_stmt_return_query()
{
..

if (stmt->query != NULL)

{

/* static query */

exec_run_select(estate, stmt->query, 0, &portal, true);
..
}

In this function we are using controlled fetch mechanism (count as 50) to
fetch the tuples which we initially thought of not supporting for
parallelism,
as such a mechanism is not built for parallel workers and that is the
reason we want to prohibit cases where ever cursor is getting created.

Do we want to support parallelism for this case on the basis that this API
will eventually fetch all the tuples by using controlled fetch mechanism?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#413Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#412)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 11:45 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Agreed and on looking at code, I think in below code, if we pass
parallelOK as true for the cases where Portal is non-null, such a
problem could happen.

static int

exec_run_select(PLpgSQL_execstate *estate,

PLpgSQL_expr *expr, long maxtuples, Portal *portalP,

bool parallelOK)

{

ParamListInfo paramLI;

int rc;

/*

* On the first call for this expression generate the plan

*/

if (expr->plan == NULL)

exec_prepare_plan(estate, expr, parallelOK ?

CURSOR_OPT_PARALLEL_OK : 0);

/*

* If a portal was requested, put the query into the portal

*/

if (portalP != NULL)

{

/*

* Set up short-lived ParamListInfo

*/

paramLI = setup_unshared_param_list(estate, expr);

*portalP = SPI_cursor_open_with_paramlist(NULL, expr->plan,

paramLI,

estate->readonly_func);

and one such case is

exec_stmt_return_query()
{
..

if (stmt->query != NULL)

{

/* static query */

exec_run_select(estate, stmt->query, 0, &portal, true);

..
}

In this function we are using controlled fetch mechanism (count as 50) to
fetch the tuples which we initially thought of not supporting for

parallelism,

as such a mechanism is not built for parallel workers and that is the
reason we want to prohibit cases where ever cursor is getting created.

Here, I wanted to say that is one of the reasons for prohibiting cursors
for parallelism, certainly there are others like Backward scan.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#414Noah Misch
noah@leadboat.com
In reply to: Amit Kapila (#411)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 11:00:57AM +0530, Amit Kapila wrote:

On Sat, Oct 17, 2015 at 6:15 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Oct 15, 2015 at 04:30:01PM +0530, Amit Kapila wrote:

On Mon, Oct 12, 2015 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

plpgsql_param_fetch() assumes that it can detect whether it's being
called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that this
test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum() as
predicted. Calls from SerializeParamList() need the same treatment as
calls from copyParamList() because it, too, will try to evaluate every
parameter in the list.

From what I understood by looking at code in this area, I think the

check

params != estate->paramLI and code under it is required for parameters
that are setup by setup_unshared_param_list(). Now unshared params
are only created for Cursors and expressions that are passing a R/W
object pointer; for cursors we explicitly prohibit the parallel
plan generation
and I am not sure if it makes sense to generate parallel plans for
expressions
involving R/W object pointer, if we don't generate parallel plan where
expressions involve such parameters, then SerializeParamList() should

not

be affected by the check mentioned by you.

The trouble comes from the opposite direction. A setup_unshared_param_list()
list is fine under today's code, but a shared param list needs more help. To
say it another way, parallel queries that use the shared estate->paramLI need,
among other help, the logic now guarded by "params != estate->paramLI".

Why would a parallel query need such a logic, that logic is needed mainly
for cursor with params and we don't want a parallelize such cases?

This is not about mixing cursors with parallelism. Cursors get special
treatment because each cursor copies its param list. Parallel query also
copies (more precisely, serializes) its param list. You need certain logic
for every param list subject to being copied. If PostgreSQL had no concept of
cursors, we'd be writing that same logic from scratch for parallel query.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#415Amit Kapila
amit.kapila16@gmail.com
In reply to: Noah Misch (#414)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 12:06 PM, Noah Misch <noah@leadboat.com> wrote:

On Sat, Oct 17, 2015 at 11:00:57AM +0530, Amit Kapila wrote:

On Sat, Oct 17, 2015 at 6:15 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Oct 15, 2015 at 04:30:01PM +0530, Amit Kapila wrote:

On Mon, Oct 12, 2015 at 9:16 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

plpgsql_param_fetch() assumes that it can detect whether it's

being

called from copyParamList() by checking whether params !=
estate->paramLI. I don't know why this works, but I do know that

this

test fails to detect the case where it's being called from
SerializeParamList(), which causes failures in exec_eval_datum()

as

predicted. Calls from SerializeParamList() need the same

treatment as

calls from copyParamList() because it, too, will try to evaluate

every

parameter in the list.

From what I understood by looking at code in this area, I think the

check

params != estate->paramLI and code under it is required for

parameters

that are setup by setup_unshared_param_list(). Now unshared params
are only created for Cursors and expressions that are passing a R/W
object pointer; for cursors we explicitly prohibit the parallel
plan generation
and I am not sure if it makes sense to generate parallel plans for
expressions
involving R/W object pointer, if we don't generate parallel plan

where

expressions involve such parameters, then SerializeParamList()

should

not

be affected by the check mentioned by you.

The trouble comes from the opposite direction. A

setup_unshared_param_list()

list is fine under today's code, but a shared param list needs more

help. To

say it another way, parallel queries that use the shared

estate->paramLI need,

among other help, the logic now guarded by "params !=

estate->paramLI".

Why would a parallel query need such a logic, that logic is needed

mainly

for cursor with params and we don't want a parallelize such cases?

This is not about mixing cursors with parallelism. Cursors get special
treatment because each cursor copies its param list. Parallel query also
copies (more precisely, serializes) its param list. You need certain

logic

for every param list subject to being copied.

I am not denying from that fact, the point I wanted to convey here is that
the logic guarded by "params != estate->paramLI" in plpgsql_param_fetch
is only needed if cursors are in use otherwise we won't need them even
for parallel query.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#416Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#415)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 2:44 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I am not denying from that fact, the point I wanted to convey here is that
the logic guarded by "params != estate->paramLI" in plpgsql_param_fetch
is only needed if cursors are in use otherwise we won't need them even
for parallel query.

Well, I think what Noah and are trying to explain is that this is not
true. The problem is that, even if there are no cursors anywhere in
the picture, there might be some variable in the param list that is
not used by the parallel query but which, if evaluated, leads to an
error.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#417Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#412)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 2:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Agreed and on looking at code, I think in below code, if we pass
parallelOK as true for the cases where Portal is non-null, such a
problem could happen.

and one such case is

exec_stmt_return_query()
{
..

if (stmt->query != NULL)

{

/* static query */

exec_run_select(estate, stmt->query, 0, &portal, true);

..
}

In this function we are using controlled fetch mechanism (count as 50) to
fetch the tuples which we initially thought of not supporting for
parallelism,
as such a mechanism is not built for parallel workers and that is the
reason we want to prohibit cases where ever cursor is getting created.

Do we want to support parallelism for this case on the basis that this API
will eventually fetch all the tuples by using controlled fetch mechanism?

That was my idea when I made that change, but I think it's not going
to work out well given the way the rest of the code works. Possibly
that should be reverted for now, but maybe only after testing it.

It's worth noting that, as of commit
bfc78d7196eb28cd4e3d6c24f7e607bacecf1129, the consequences of invoking
the executor with a fetch count have been greatly reduced.
Previously, the assumption was that doing that was broken, and if you
did it you got to keep both pieces. But that commit rejiggered things
so that your parallel plan just gets run serially in that case. That
might not be great from a performance perspective, but it beats
undefined behavior by a wide margin. So I suspect that there are some
decisions about where to pass CURSOR_OPT_PARALLEL_OK that need to be
revisited in the light of that change. I haven't had time to do that
yet, but we should do it as soon as we get time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#418Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#416)
Re: Parallel Seq Scan

On Sat, Oct 17, 2015 at 4:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Oct 17, 2015 at 2:44 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I am not denying from that fact, the point I wanted to convey here is

that

the logic guarded by "params != estate->paramLI" in plpgsql_param_fetch
is only needed if cursors are in use otherwise we won't need them even
for parallel query.

Well, I think what Noah and are trying to explain is that this is not
true. The problem is that, even if there are no cursors anywhere in
the picture, there might be some variable in the param list that is
not used by the parallel query but which, if evaluated, leads to an
error.

I understand what Noah wants to say, it's just that I am not able to see
how that is possible by looking at current code. The code is not straight
forward in this area, so there is a good chance that I might be missing
something.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#419Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#402)
1 attachment(s)
Re: Parallel Seq Scan

On Fri, Oct 16, 2015 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 5, 2015 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

[ new patch for heapam.c changes ]

There's a second patch attached here as well, parallel-relaunch.patch,
which makes it possible to relaunch workers with the same parallel
context. Currently, after you WaitForParallelWorkersToFinish(), you
must proceed without fail to DestroyParallelContext(). With this
rather simple patch, you have the option to instead go back and again
LaunchParallelWorkers(), which is nice because it avoids the overhead
of setting up a new DSM and filling it with all of your transaction
state a second time. I'd like to commit this as well, and I think we
should revise execParallel.c to use it.

I have rebased the partial seq scan patch based on the above committed
patches. Now for rescanning it reuses the dsm and to achieve that we
need to ensure that workers have been completely shutdown and then
reinitializes the dsm. To ensure complete shutdown of workers, current
function WaitForParallelWorkersToFinish is not sufficient as that just
waits for the last message to receive from worker backend, so I have
written a new function WaitForParallelWorkersToDie. Also on receiving
'X' message in HandleParallelMessage, it just frees the worker handle
without ensuring if the worker is died due to which later it will be
difficult
to even find whether worker is died or not, so I have removed that code
from HandleParallelMessage. Another change is that after receiving last
tuple in Gather node, it just shutdown down the workers without
destroying the dsm.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_partialseqscan_v22.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v22.patchDownload
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0085987..69d7c27 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -110,6 +110,7 @@ static void HandleParallelMessage(ParallelContext *, int, StringInfo msg);
 static void ParallelErrorContext(void *arg);
 static void ParallelExtensionTrampoline(dsm_segment *seg, shm_toc *toc);
 static void ParallelWorkerMain(Datum main_arg);
+static void WaitForParallelWorkersToDie(ParallelContext *pcxt);
 
 /*
  * Establish a new parallel context.  This should be done after entering
@@ -384,36 +385,20 @@ InitializeParallelDSM(ParallelContext *pcxt)
 }
 
 /*
- * Launch parallel workers.
+ * Reinitialize the dynamic shared memory segment for a parallel context such
+ * that it could be reused by launcing the workers again.
  */
 void
-LaunchParallelWorkers(ParallelContext *pcxt)
+ReInitializeParallelDSM(ParallelContext *pcxt)
 {
-	MemoryContext oldcontext;
-	BackgroundWorker worker;
-	int			i;
-	bool		any_registrations_failed = false;
+	WaitForParallelWorkersToDie(pcxt);
 
-	/* Skip this if we have no workers. */
-	if (pcxt->nworkers == 0)
-		return;
-
-	/* If we do have workers, we'd better have a DSM segment. */
-	Assert(pcxt->seg != NULL);
-
-	/* We might be running in a short-lived memory context. */
-	oldcontext = MemoryContextSwitchTo(TopTransactionContext);
-
-	/*
-	 * This function can be called for a parallel context for which it has
-	 * already been called previously, but only if all of the old workers
-	 * have already exited.  When this case arises, we need to do some extra
-	 * reinitialization.
-	 */
+	/* Reinitialize parallel context */
 	if (pcxt->nworkers_launched > 0)
 	{
 		FixedParallelState *fps;
 		char	   *error_queue_space;
+		int			i;
 
 		/* Clean out old worker handles. */
 		for (i = 0; i < pcxt->nworkers; ++i)
@@ -449,6 +434,28 @@ LaunchParallelWorkers(ParallelContext *pcxt)
 		/* Reset number of workers launched. */
 		pcxt->nworkers_launched = 0;
 	}
+}
+
+/*
+ * Launch parallel workers.
+ */
+void
+LaunchParallelWorkers(ParallelContext *pcxt)
+{
+	MemoryContext oldcontext;
+	BackgroundWorker worker;
+	int			i;
+	bool		any_registrations_failed = false;
+
+	/* Skip this if we have no workers. */
+	if (pcxt->nworkers == 0)
+		return;
+
+	/* If we do have workers, we'd better have a DSM segment. */
+	Assert(pcxt->seg != NULL);
+
+	/* We might be running in a short-lived memory context. */
+	oldcontext = MemoryContextSwitchTo(TopTransactionContext);
 
 	/* Configure a worker. */
 	snprintf(worker.bgw_name, BGW_MAXLEN, "parallel worker for PID %d",
@@ -553,6 +560,53 @@ WaitForParallelWorkersToFinish(ParallelContext *pcxt)
 }
 
 /*
+ * Wait for all workers to die.
+ *
+ * This function ensures that workers have been completely shutdown.  The
+ * difference between WaitForParallelWorkersToFinish and this function is
+ * that former just ensures that last message sent by worker backend is
+ * received by master backend whereas this ensures the complete shutdown.
+ */
+static void
+WaitForParallelWorkersToDie(ParallelContext *pcxt)
+{
+	int			i;
+
+	/* Wait until the workers actually die. */
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		BgwHandleStatus status;
+
+		if (pcxt->worker == NULL || pcxt->worker[i].bgwhandle == NULL)
+			continue;
+
+		/*
+		 * We can't finish transaction commit or abort until all of the
+		 * workers are dead.  This means, in particular, that we can't respond
+		 * to interrupts at this stage.
+		 */
+		HOLD_INTERRUPTS();
+		status = WaitForBackgroundWorkerShutdown(pcxt->worker[i].bgwhandle);
+		RESUME_INTERRUPTS();
+
+		/*
+		 * If the postmaster kicked the bucket, we have no chance of cleaning
+		 * up safely -- we won't be able to tell when our workers are actually
+		 * dead.  This doesn't necessitate a PANIC since they will all abort
+		 * eventually, but we can't safely continue this session.
+		 */
+		if (status == BGWH_POSTMASTER_DIED)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("postmaster exited during a parallel transaction")));
+
+		/* Release memory. */
+		pfree(pcxt->worker[i].bgwhandle);
+		pcxt->worker[i].bgwhandle = NULL;
+	}
+}
+
+/*
  * Destroy a parallel context.
  *
  * If expecting a clean exit, you should use WaitForParallelWorkersToFinish()
@@ -609,38 +663,7 @@ DestroyParallelContext(ParallelContext *pcxt)
 		pcxt->private_memory = NULL;
 	}
 
-	/* Wait until the workers actually die. */
-	for (i = 0; i < pcxt->nworkers; ++i)
-	{
-		BgwHandleStatus status;
-
-		if (pcxt->worker == NULL || pcxt->worker[i].bgwhandle == NULL)
-			continue;
-
-		/*
-		 * We can't finish transaction commit or abort until all of the
-		 * workers are dead.  This means, in particular, that we can't respond
-		 * to interrupts at this stage.
-		 */
-		HOLD_INTERRUPTS();
-		status = WaitForBackgroundWorkerShutdown(pcxt->worker[i].bgwhandle);
-		RESUME_INTERRUPTS();
-
-		/*
-		 * If the postmaster kicked the bucket, we have no chance of cleaning
-		 * up safely -- we won't be able to tell when our workers are actually
-		 * dead.  This doesn't necessitate a PANIC since they will all abort
-		 * eventually, but we can't safely continue this session.
-		 */
-		if (status == BGWH_POSTMASTER_DIED)
-			ereport(FATAL,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-				 errmsg("postmaster exited during a parallel transaction")));
-
-		/* Release memory. */
-		pfree(pcxt->worker[i].bgwhandle);
-		pcxt->worker[i].bgwhandle = NULL;
-	}
+	WaitForParallelWorkersToDie(pcxt);
 
 	/* Free the worker array itself. */
 	if (pcxt->worker != NULL)
@@ -799,9 +822,7 @@ HandleParallelMessage(ParallelContext *pcxt, int i, StringInfo msg)
 
 		case 'X':				/* Terminate, indicating clean exit */
 			{
-				pfree(pcxt->worker[i].bgwhandle);
 				pfree(pcxt->worker[i].error_mqh);
-				pcxt->worker[i].bgwhandle = NULL;
 				pcxt->worker[i].error_mqh = NULL;
 				break;
 			}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..d03fbde 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -729,6 +729,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -850,6 +851,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
@@ -1005,6 +1009,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
@@ -1270,6 +1275,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2353,6 +2359,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..38a92fe 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,8 +21,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 163650c..b3d041c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -157,6 +158,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
@@ -468,6 +473,9 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_PartialSeqScan:
+			return false;
+
 		case T_SampleScan:
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..6e05598 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
 		case T_SampleScanState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 3bb8206..de83dbc 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -17,6 +17,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -158,10 +159,16 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	/* Count this node. */
 	e->nnodes++;
 
-	/*
-	 * XXX. Call estimators for parallel-aware nodes here, when we have
-	 * some.
-	 */
+	/* Call estimators for parallel-aware nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanEstimate((PartialSeqScanState *) planstate,
+									   e->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
@@ -196,10 +203,16 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/*
-	 * XXX. Call initializers for parallel-aware plan nodes, when we have
-	 * some.
-	 */
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanInitializeDSM((PartialSeqScanState *) planstate,
+											d->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
 }
@@ -247,6 +260,43 @@ ExecParallelSetupTupleQueues(ParallelContext *pcxt)
 }
 
 /*
+ * It re-initializes the response queues for backend workers to return tuples
+ * to the main backend and start the workers.
+ */
+shm_mq_handle **
+ExecParallelReInitializeTupleQueues(ParallelContext *pcxt)
+{
+	shm_mq_handle **responseq;
+	char	   *tqueuespace;
+	int			i;
+
+	/* Skip this if no workers. */
+	if (pcxt->nworkers == 0)
+		return NULL;
+
+	/* Allocate memory for shared memory queue handles. */
+	responseq = (shm_mq_handle **)
+		palloc(pcxt->nworkers * sizeof(shm_mq_handle *));
+
+	tqueuespace = shm_toc_lookup(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE);
+
+	/* Create the queues, and become the receiver for each. */
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		shm_mq	   *mq;
+
+		mq = shm_mq_create(tqueuespace + i * PARALLEL_TUPLE_QUEUE_SIZE,
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+
+		shm_mq_set_receiver(mq, MyProc);
+		responseq[i] = shm_mq_attach(mq, pcxt->seg, NULL);
+	}
+
+	/* Return array of handles. */
+	return responseq;
+}
+
+/*
  * Sets up the required infrastructure for backend workers to perform
  * execution and return results to the main backend.
  */
@@ -548,6 +598,32 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Initialize the PlanState and it's descendents with the information
+ * retrieved from shared memory.  This has to be done once the PlanState
+ * is allocated and initialized by executor for each node aka after
+ * ExecutorStart().
+ */
+static bool
+ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
+{
+	if (planstate == NULL)
+		return false;
+
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanInitParallelScanDesc((PartialSeqScanState *) planstate,
+												   toc);
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
@@ -583,6 +659,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 
 	/* Start up the executor, have it run the plan, and then shut it down. */
 	ExecutorStart(queryDesc, 0);
+	ExecParallelInitializeWorker(queryDesc->planstate, toc);
 	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
 	ExecutorFinish(queryDesc);
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5bc1d48..1b929c2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -193,6 +194,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_SampleScan:
 			result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
 													  estate, eflags);
@@ -419,6 +425,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
@@ -665,6 +675,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
@@ -804,10 +818,7 @@ ExecShutdownNode(PlanState *node)
 	switch (nodeTag(node))
 	{
 		case T_GatherState:
-			{
-				ExecShutdownGather((GatherState *) node);
-				return true;
-			}
+			ExecShutdownGather((GatherState *) node);
 			break;
 		default:
 			break;
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 7e2272f..e669c4a 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -26,6 +26,7 @@
 
 
 static TupleTableSlot *gather_getnext(GatherState *gatherstate);
+static void ExecShutdownGatherWorkers(GatherState *node);
 
 
 /* ----------------------------------------------------------------
@@ -120,9 +121,10 @@ ExecGather(GatherState *node)
 			bool	got_any_worker = false;
 
 			/* Initialize the workers required to execute Gather node. */
-			node->pei = ExecInitParallelPlan(node->ps.lefttree,
-											 estate,
-											 gather->num_workers);
+			if (!node->pei)
+				node->pei = ExecInitParallelPlan(node->ps.lefttree,
+												 estate,
+												 gather->num_workers);
 
 			/*
 			 * Register backend workers. We might not get as many as we
@@ -210,7 +212,7 @@ gather_getnext(GatherState *gatherstate)
 									   gatherstate->need_to_scan_locally,
 									   &done);
 			if (done)
-				ExecShutdownGather(gatherstate);
+				ExecShutdownGatherWorkers(gatherstate);
 
 			if (HeapTupleIsValid(tup))
 			{
@@ -241,11 +243,34 @@ gather_getnext(GatherState *gatherstate)
 }
 
 /* ----------------------------------------------------------------
+ *		ExecShutdownGatherWorkers
+ *
+ *		Destroy the parallel workers.  Collect all the stats after
+ *		workers are stopped, else some work done by workers won't be
+ *		accounted.
+ * ----------------------------------------------------------------
+ */
+void
+ExecShutdownGatherWorkers(GatherState *node)
+{
+	/* Shut down tuple queue funnel before shutting down workers. */
+	if (node->funnel != NULL)
+	{
+		DestroyTupleQueueFunnel(node->funnel);
+		node->funnel = NULL;
+	}
+
+	/* Now shut down the workers. */
+	if (node->pei != NULL)
+		ExecParallelFinish(node->pei);
+}
+
+/* ----------------------------------------------------------------
  *		ExecShutdownGather
  *
- *		Destroy the setup for parallel workers.  Collect all the
- *		stats after workers are stopped, else some work done by
- *		workers won't be accounted.
+ *		Destroy the setup for parallel workers including parallel context.
+ *		Collect all the stats after workers are stopped, else some work
+ *		done by workers won't be accounted.
  * ----------------------------------------------------------------
  */
 void
@@ -282,14 +307,21 @@ void
 ExecReScanGather(GatherState *node)
 {
 	/*
-	 * Re-initialize the parallel context and workers to perform rescan of
-	 * relation.  We want to gracefully shutdown all the workers so that they
+	 * Re-initialize the parallel workers to perform rescan of relation.
+	 * We want to gracefully shutdown all the workers so that they
 	 * should be able to propagate any error or other information to master
-	 * backend before dying.
+	 * backend before dying.  Parallel context will be reused for rescan.
 	 */
-	ExecShutdownGather(node);
+	ExecShutdownGatherWorkers(node);
 
 	node->initialized = false;
 
+	if (node->pei)
+	{
+		ReInitializeParallelDSM(node->pei->pcxt);
+		node->pei->tqueue =
+				ExecParallelReInitializeTupleQueues(node->pei->pcxt);
+	}
+
 	ExecReScan(node->ps.lefttree);
 }
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..bc37b9b
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check are
+	 * keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanEstimate
+ *
+ *		estimates the space required to serialize partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->pscan_len = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+
+	/* key for partial scan information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitializeDSM
+ *
+ *		Initialize the DSM with the contents required to perform
+ *		partial seqscan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Store parallel heap scan descriptor in dynamic shared memory.
+	 */
+	node->pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+	heap_parallelscan_initialize(node->pscan,
+								 node->ss.ss_currentRelation,
+								 estate->es_snapshot);
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id,
+				   node->pscan);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitParallelDesc
+ *
+ *		Retrieve the contents from DSM related to partial seq scan node
+ *		and initialize the partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc)
+{
+	node->pscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize it
+	 * during ExecutorStart phase, however we need ParallelHeapScanDesc to
+	 * initialize the scan in case of this node and the same is initialized by
+	 * the Gather node during ExecutorRun phase.
+	 */
+	if (!node->ss.ss_currentScanDesc)
+	{
+		node->ss.ss_currentScanDesc =
+			heap_beginscan_parallel(node->ss.ss_currentRelation, node->pscan);
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	HeapScanDesc scan;
+
+	scan = node->ss.ss_currentScanDesc;
+
+	if (scan)
+		heap_rescan(scan,		/* scan desc */
+					NULL);		/* new scan keys */
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c176ff9..fdcccef 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -384,6 +384,22 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copySampleScan
  */
 static SampleScan *
@@ -4264,6 +4280,9 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3e75cd1..5ce45e2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -460,6 +460,14 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outSampleScan(StringInfo str, const SampleScan *node)
 {
 	WRITE_NODE_TYPE("SAMPLESCAN");
@@ -3020,6 +3028,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 94ba6dc..3d3448d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1606,6 +1606,19 @@ _readSeqScan(void)
 }
 
 /*
+ * _readPartialSeqScan
+ */
+static PartialSeqScan *
+_readPartialSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(PartialSeqScan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
  * _readSampleScan
  */
 static SampleScan *
@@ -2337,6 +2350,8 @@ parseNodeString(void)
 		return_value = _readScan();
 	else if (MATCH("SEQSCAN", 7))
 		return_value = _readSeqScan();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readPartialSeqScan();
 	else if (MATCH("SAMPLESCAN", 10))
 		return_value = _readSampleScan();
 	else if (MATCH("INDEXSCAN", 9))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1b61fd9..3239cec 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -227,6 +227,49 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_partialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan via the
+	 * ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers. Here assumption is
+	 * that disk access cost will also be equally shared between workers which
+	 * is generally true unless there are too many workers working on a
+	 * relatively lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for partial
+	 * sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_samplescan
  *	  Determines and returns the cost of scanning a relation using sampling.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..ce25cbf
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * expr_is_parallel_safe
+ *	  is a paraticular expression parallel safe
+ *
+ * Conditions checked here:
+ *
+ * 1. The expresion must not contain any parallel unsafe or parallel
+ * restricted functions.
+ *
+ * 2. The expression must not contain any initplan or subplan.  We can
+ * probably remove this restriction once we have support of infrastructure
+ * for execution of initplans and subplans at parallel (Gather) nodes.
+ */
+bool
+expr_is_parallel_safe(Node *node)
+{
+	if (check_parallel_safety(node, false))
+		return false;
+
+	if (contain_subplans_or_initplans(node))
+		return false;
+
+	return true;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+	ListCell   *l;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (max_parallel_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * Allow parallel paths only if all the clauses for relation are parallel
+	 * safe.  We can allow execution of parallel restricted clauses in master
+	 * backend, but for that planner should have infrastructure to pull all
+	 * the parallel restricted clauses from below nodes to the Gather node
+	 * which will then execute such clauses in master backend.
+	 */
+	foreach(l, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (!expr_is_parallel_safe((Node *) rinfo->clause))
+			return;
+	}
+
+	num_parallel_workers = Min(max_parallel_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the gather path which master backend needs to execute. */
+	add_path(rel, (Path *) create_gather_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 791b64e..b142811 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,8 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
 static Gather *create_gather_plan(PlannerInfo *root,
@@ -104,6 +106,8 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+					Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
 static Gather *make_gather(List *qptlist, List *qpqual,
@@ -237,6 +241,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -357,6 +362,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_SampleScan:
 			plan = (Plan *) create_samplescan_plan(root,
 												   best_path,
@@ -567,6 +579,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -1184,6 +1197,46 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan	   *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_samplescan_plan
  *	 Returns a samplescan plan for the base relation scanned by 'best_path'
  *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3481,6 +3534,24 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static SampleScan *
 make_samplescan(List *qptlist,
 				List *qpqual,
@@ -5174,6 +5245,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Append:
 		case T_MergeAppend:
 		case T_RecursiveUnion:
+		case T_Gather:
 			return false;
 		default:
 			break;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 536b55e..d5329fb 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -202,13 +202,13 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->hasRowSecurity = false;
 
 	/*
-	 * Assess whether it's feasible to use parallel mode for this query.
-	 * We can't do this in a standalone backend, or if the command will
-	 * try to modify any data, or if this is a cursor operation, or if any
+	 * Assess whether it's feasible to use parallel mode for this query. We
+	 * can't do this in a standalone backend, or if the command will try to
+	 * modify any data, or if this is a cursor operation, or if any
 	 * parallel-unsafe functions are present in the query tree.
 	 *
-	 * For now, we don't try to use parallel mode if we're running inside
-	 * a parallel worker.  We might eventually be able to relax this
+	 * For now, we don't try to use parallel mode if we're running inside a
+	 * parallel worker.  We might eventually be able to relax this
 	 * restriction, but for now it seems best not to have parallel workers
 	 * trying to create their own parallel workers.
 	 *
@@ -225,7 +225,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
 		parse->utilityStmt == NULL && !IsParallelWorker() &&
 		!IsolationIsSerializable() &&
-		!contain_parallel_unsafe((Node *) parse);
+		!check_parallel_safety((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
@@ -238,9 +238,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 *
 	 * (It's been suggested that we should always impose these restrictions
 	 * whenever glob->parallelModeOK is true, so that it's easier to notice
-	 * incorrectly-labeled functions sooner.  That might be the right thing
-	 * to do, but for now I've taken this approach.  We could also control
-	 * this with a GUC.)
+	 * incorrectly-labeled functions sooner.  That might be the right thing to
+	 * do, but for now I've taken this approach.  We could also control this
+	 * with a GUC.)
 	 *
 	 * FIXME: It's assumed that code further down will set parallelModeNeeded
 	 * to true if a parallel path is actually chosen.  Since the core
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 8c6c571..36c959c 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -447,6 +447,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 82414d4..99dacde 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2234,6 +2234,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..2355cc6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -87,16 +87,25 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+}	check_parallel_safety_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
 static bool find_window_functions_walker(Node *node, WindowFuncLists *lists);
 static bool expression_returns_set_rows_walker(Node *node, double *count);
 static bool contain_subplans_walker(Node *node, void *context);
+static bool contain_subplans_or_initplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool check_parallel_safety_walker(Node *node,
+							 check_parallel_safety_arg * context);
+static bool parallel_too_dangerous(char proparallel,
+					   check_parallel_safety_arg * context);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1204,13 +1213,16 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
  *****************************************************************************/
 
 bool
-contain_parallel_unsafe(Node *node)
+check_parallel_safety(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	check_parallel_safety_arg context;
+
+	context.allow_restricted = allow_restricted;
+	return check_parallel_safety_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+check_parallel_safety_walker(Node *node, check_parallel_safety_arg * context)
 {
 	if (node == NULL)
 		return false;
@@ -1218,7 +1230,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1227,7 +1239,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1236,7 +1248,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1245,7 +1257,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1254,7 +1266,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1268,12 +1280,12 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1282,7 +1294,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1294,28 +1306,77 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			if (parallel_too_dangerous(op_volatile(lfirst_oid(opid)), context))
 				return true;
 		}
 		/* else fall through to check args */
 	}
 	else if (IsA(node, Query))
 	{
-		Query *query = (Query *) node;
+		Query	   *query = (Query *) node;
 
 		if (query->rowMarks != NULL)
 			return true;
 
 		/* Recurse into subselects */
 		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
+								 check_parallel_safety_walker,
 								 context, 0);
 	}
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  check_parallel_safety_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, check_parallel_safety_arg * context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+/*
+ * contain_subplans_or_initplans
+ *	  Recursively search for initplan or subplan nodes within a clause.
+ *
+ * A special purpose function for prohibiting subplan or initplan clauses
+ * in parallel query constructs.
+ *
+ * If we see any form of SubPlan node, we will return TRUE.  For InitPlan's,
+ * we return true when we see the Param node, apart from that InitPlan
+ * can contain a simple NULL constant for MULTIEXPR subquery (see comments
+ * in make_subplan), however it is okay not to care about the same as that
+ * is only possible for Update statement which is anyway prohibited.
+ *
+ * Returns true if any subplan or initplan is found.
+ */
+bool
+contain_subplans_or_initplans(Node *clause)
+{
+	return contain_subplans_or_initplans_walker(clause, NULL);
+}
+
+static bool
+contain_subplans_or_initplans_walker(Node *node, void *context)
+{
+	if (node == NULL)
+		return false;
+	if (IsA(node, SubPlan) ||
+		IsA(node, AlternativeSubPlan) ||
+		IsA(node, SubLink))
+		return true;			/* abort the tree traversal and return true */
+	else if (IsA(node, Param))
+	{
+		Param	   *paramval = (Param *) node;
+
+		if (paramval->paramkind == PARAM_EXEC)
+			return true;
+	}
+	return expression_tree_walker(node, contain_subplans_or_initplans_walker, context);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1895a68..2fd7ae5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -712,6 +712,28 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_partialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_samplescan_path
  *	  Creates a path node for a sampled table scan.
  */
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index d4b7c5d..6da06f6 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -56,6 +56,7 @@ extern bool InitializingParallelWorker;
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExternalFunction(char *library_name, char *function_name, int nworkers);
 extern void InitializeParallelDSM(ParallelContext *);
+extern void ReInitializeParallelDSM(ParallelContext *pcxt);
 extern void LaunchParallelWorkers(ParallelContext *);
 extern void WaitForParallelWorkersToFinish(ParallelContext *);
 extern void DestroyParallelContext(ParallelContext *);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 505500e..4808fc1 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -33,5 +33,6 @@ extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
 					 EState *estate, int nworkers);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
 extern void ExecParallelCleanup(ParallelExecutorInfo *pei);
+extern shm_mq_handle **ExecParallelReInitializeTupleQueues(ParallelContext *pcxt);
 
 #endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..77e5311
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc);
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+					   EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4fcdcc4..d71892e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
@@ -1254,6 +1255,18 @@ typedef struct ScanState
  */
 typedef ScanState SeqScanState;
 
+/*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelHeapScanDesc	pscan;	/* parallel heap scan descriptor
+									 * for partial scan */
+	Size		pscan_len;		/* size of parallel heap scan descriptor */
+} PartialSeqScanState;
+
 /* ----------------
  *	 SampleScanState information
  * ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 94bdb7c..71496b9 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
 	T_SampleScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
 	T_SampleScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 92fd8e4..9cf5c0c 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -72,7 +72,7 @@ typedef struct PlannedStmt
 
 	bool		hasRowSecurity; /* row security applied? */
 
-	bool		parallelModeNeeded; /* parallel mode required to execute? */
+	bool		parallelModeNeeded;		/* parallel mode required to execute? */
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -287,6 +287,12 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
  *		table sample scan node
  * ----------------
  */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..747b05b 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,8 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool check_parallel_safety(Node *node, bool allow_restricted);
+extern bool contain_subplans_or_initplans(Node *clause);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 25a7303..8640567 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -73,6 +73,9 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7a4940c..3b97b73 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..6cd4479 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,15 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer);
+extern bool expr_is_parallel_safe(Node *node);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..5492ba0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1199,6 +1199,8 @@ OverrideStackEntry
 PACE_HEADER
 PACL
 ParallelExecutorInfo
+PartialSeqScan
+PartialSeqScanState
 PATH
 PBOOL
 PCtxtHandle
#420Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#387)
1 attachment(s)
Re: Parallel Seq Scan

On Tue, Oct 13, 2015 at 5:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

- Although the changes in parallelpaths.c are in a good direction, I'm
pretty sure this is not yet up to scratch. I am less sure exactly
what needs to be fixed, so I'll have to give some more thought to
that.

Please find attached a proposed set of changes that I think are
better. These changes compute a consider_parallel flag for each
RelOptInfo, which is true if it's a non-temporary relation whose
baserestrictinfo references no PARAM_EXEC parameters, sublinks, or
parallel-restricted functions. Actually, I made an effort to set the
flag correctly even for baserels other than plain tables, and for
joinrels, though we don't technically need that stuff until we get to
the point of pushing joins beneath Gather nodes. When we get there,
it will be important - any joinrel for which consider_parallel = false
needn't even try to generate parallel paths, while if
consider_parallel = true then we can consider it, if the costing makes
sense.

The advantage of this is that the logic is centralized. If we have
parallel seq scan and also, say, parallel bitmap heap scan, your
approach would require that we duplicate the logic to check for
parallel-restricted functions for each path generation function. By
caching it in the RelOptInfo, we don't have to do that. The function
you wrote to generate parallel paths can just check the flag; if it's
false, return without generating any paths. If it's true, then
parallel paths can be considered.

Ultimately, I think that each RelOptInfo should have a new List *
member containing a list of partial paths for that relation. For a
baserel, we generate a partial path (e.g. Partial Seq Scan). Then, we
can consider turning each partial path into a complete path by pushing
a Gather path on top of it. For a joinrel, we can consider generating
a partial hash join or partial nest loop path by taking an outer
partial path and an ordinary inner path and putting the appropriate
path on top. In theory it would also be correct to generate merge
join paths this way, but it's difficult to believe that such a plan
would ever be anything but a disaster. These can then be used to
generate a complete path by putting a Gather node on top of them, or
they can bubble up to the next level of the join tree in the same way.
However, I think for the first version of this we can keep it simple:
if the consider_parallel flag is set on a relation, consider Gather ->
Partial Seq Scan. If not, forget it.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

consider-parallel-v1.patchapplication/x-patch; name=consider-parallel-v1.patchDownload
commit 79eb838ba87b19788617aac611712ed734e0102c
Author: Robert Haas <rhaas@postgresql.org>
Date:   Fri Oct 2 23:57:46 2015 -0400

    Strengthen planner infrastructure for parallelism.
    
    Add a new flag, consider_parallel, to each RelOptInfo, indicating
    whether a plan for that relation could conceivably be run inside of
    a parallel worker.  Right now, we're pretty conservative: for example,
    it might be possible to defer applying a parallel-restricted qual
    in a worker, and later do it in the leader, but right now we just
    don't try to parallelize access to that relation.  That's probably
    the right decision in most cases, anyway.

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3e75cd1..0030a9b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1868,6 +1868,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_INT_FIELD(width);
 	WRITE_BOOL_FIELD(consider_startup);
 	WRITE_BOOL_FIELD(consider_param_startup);
+	WRITE_BOOL_FIELD(consider_parallel);
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..f582b86 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -21,6 +21,7 @@
 #include "access/tsmapi.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_operator.h"
+#include "catalog/pg_proc.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -71,6 +72,9 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 				 Index rti, RangeTblEntry *rte);
 static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
 				   RangeTblEntry *rte);
+static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
+						  RangeTblEntry *rte);
+static bool function_rte_parallel_ok(RangeTblEntry *rte);
 static void set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					   RangeTblEntry *rte);
 static void set_tablesample_rel_size(PlannerInfo *root, RelOptInfo *rel,
@@ -158,7 +162,8 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	set_base_rel_consider_startup(root);
 
 	/*
-	 * Generate access paths for the base rels.
+	 * Generate access paths for the base rels.  set_base_rel_sizes also
+	 * sets the consider_parallel flag for each baserel, if appropriate.
 	 */
 	set_base_rel_sizes(root);
 	set_base_rel_pathlists(root);
@@ -222,9 +227,12 @@ set_base_rel_consider_startup(PlannerInfo *root)
 /*
  * set_base_rel_sizes
  *	  Set the size estimates (rows and widths) for each base-relation entry.
+ *    Also determine whether to consider parallel paths for base relations.
  *
  * We do this in a separate pass over the base rels so that rowcount
- * estimates are available for parameterized path generation.
+ * estimates are available for parameterized path generation, and also so
+ * that the consider_parallel flag is set correctly before we begin to
+ * generate paths.
  */
 static void
 set_base_rel_sizes(PlannerInfo *root)
@@ -234,6 +242,7 @@ set_base_rel_sizes(PlannerInfo *root)
 	for (rti = 1; rti < root->simple_rel_array_size; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
+		RangeTblEntry *rte;
 
 		/* there may be empty slots corresponding to non-baserel RTEs */
 		if (rel == NULL)
@@ -245,7 +254,15 @@ set_base_rel_sizes(PlannerInfo *root)
 		if (rel->reloptkind != RELOPT_BASEREL)
 			continue;
 
-		set_rel_size(root, rel, rti, root->simple_rte_array[rti]);
+		rte = root->simple_rte_array[rti];
+		set_rel_size(root, rel, rti, rte);
+
+		/*
+		 * If parallelism is allowable for this query in general, see whether
+		 * it's allowable for this rel in particular.
+		 */
+		if (root->glob->parallelModeOK)
+			set_rel_consider_parallel(root, rel, rte);
 	}
 }
 
@@ -459,6 +476,131 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * If this relation could possibly be scanned from within a worker, then set
+ * the consider_parallel flag.  The flag has previously been initialized to
+ * false, so we just bail out if it becomes clear that we can't safely set it.
+ */
+static void
+set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
+						  RangeTblEntry *rte)
+{
+	/* Don't call this if parallelism is disallowed for the entire query. */
+	Assert(root->glob->parallelModeOK);
+
+	/* Don't call this for non-baserels. */
+	Assert(rel->reloptkind == RELOPT_BASEREL);
+
+	/* Assorted checks based on rtekind. */
+	switch (rte->rtekind)
+	{
+		case RTE_RELATION:
+			/*
+			 * Currently, parallel workers can't access the leader's temporary
+			 * tables.  We could possibly relax this if the wrote all of its
+			 * local buffers at the start of the query and made no changes
+			 * thereafter (maybe we could allow hint bit changes), and if we
+			 * taught the workers to read them.  Writing a large number of
+			 * temporary buffers could be expensive, though, and we don't have
+			 * the rest of the necessary infrastructure right now anyway.  So
+			 * for now, bail out if we see a temporary table.
+			 */
+			if (get_rel_persistence(rte->relid) == RELPERSISTENCE_TEMP)
+				return;
+
+			/*
+			 * Table sampling can be pushed down to workers if the sample
+			 * function and its arguments are safe.
+			 */
+			if (rte->tablesample != NULL)
+			{
+				Oid	proparallel = func_parallel(rte->tablesample->tsmhandler);
+
+				if (proparallel != PROPARALLEL_SAFE)
+					return;
+				if (has_parallel_hazard((Node *) rte->tablesample->args,
+										false))
+					return;
+				return;
+			}
+			break;
+
+		case RTE_SUBQUERY:
+			/*
+			 * Subplans currently aren't passed to workers.  Even if they
+			 * were, the subplan might be using parallelism internally, and
+			 * we can't support nested Gather nodes at present.  Finally,
+			 * we don't have a good way of knowing whether the subplan
+			 * involves any parallel-restricted operations.  It would be
+			 * nice to relax this restriction some day, but it's going to
+			 * take a fair amount of work.
+			 */
+			return;
+
+		case RTE_JOIN:
+			/* Shouldn't happen; we're only considering baserels here. */
+			Assert(false);
+			return;
+
+		case RTE_FUNCTION:
+			/* Check for parallel-restricted functions. */
+			if (!function_rte_parallel_ok(rte))
+				return;
+			break;
+
+		case RTE_VALUES:
+			/*
+			 * The data for a VALUES clause is stored in the plan tree itself,
+			 * so scanning it in a worker is fine.
+			 */
+			break;
+
+		case RTE_CTE:
+			/*
+			 * CTE tuplestores aren't shared among parallel workers, so we
+			 * force all CTE scans to happen in the leader.  Also, populating
+			 * the CTE would require executing a subplan that's not available
+			 * in the worker, might be parallel-restricted, and must get
+			 * executed only once.
+			 */
+			return;
+	}
+
+	/*
+	 * If there's anything in baserestrictinfo that's parallel-restricted,
+	 * we give up on parallelizing access to this relation.  We could consider
+	 * instead postponing application of the restricted quals until we're
+	 * above all the parallelism in the plan tree, but it's not clear that
+	 * this would be a win in very many cases, and it might be tricky to make
+	 * outer join clauses work correctly.
+	 */
+	if (has_parallel_hazard((Node *) rel->baserestrictinfo, false))
+		return;
+
+	/* We have a winner. */
+	rel->consider_parallel = true;
+}
+
+/*
+ * Check whether a function RTE is scanning something parallel-restricted.
+ */
+static bool
+function_rte_parallel_ok(RangeTblEntry *rte)
+{
+	ListCell   *lc;
+
+	foreach(lc, rte->functions)
+	{
+		RangeTblFunction *rtfunc = (RangeTblFunction *) lfirst(lc);
+
+		Assert(IsA(rtfunc, RangeTblFunction));
+		if (has_parallel_hazard(rtfunc->funcexpr, false))
+			return false;
+	}
+
+	return true;
+}
+
+/*
  * set_plain_rel_pathlist
  *	  Build access paths for a plain relation (no subquery, no inheritance)
  */
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 848df97..d73e7c0 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -20,6 +20,7 @@
  */
 #include "postgres.h"
 
+#include "optimizer/clauses.h"
 #include "optimizer/orclauses.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -70,6 +71,17 @@ query_planner(PlannerInfo *root, List *tlist,
 		/* We need a dummy joinrel to describe the empty set of baserels */
 		final_rel = build_empty_join_rel(root);
 
+		/*
+		 * If query allows parallelism in general, check whether the quals
+		 * are parallel-restricted.  There's currently no real benefit to
+		 * setting this flag correctly because we can't yet reference subplans
+		 * from parallel workers.  But that might change someday, so set this
+		 * correctly anyway.
+		 */
+		if (root->glob->parallelModeOK)
+			final_rel->consider_parallel =
+				!has_parallel_hazard(parse->jointree->quals, false);
+
 		/* The only path for it is a trivial Result path */
 		add_path(final_rel, (Path *)
 				 create_result_path((List *) parse->jointree->quals));
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 536b55e..c0db458 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -225,7 +225,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
 		parse->utilityStmt == NULL && !IsParallelWorker() &&
 		!IsolationIsSerializable() &&
-		!contain_parallel_unsafe((Node *) parse);
+		!has_parallel_hazard((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..915c8a4 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -21,6 +21,7 @@
 
 #include "access/htup_details.h"
 #include "catalog/pg_aggregate.h"
+#include "catalog/pg_class.h"
 #include "catalog/pg_language.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_proc.h"
@@ -87,6 +88,11 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+} has_parallel_hazard_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
@@ -96,7 +102,11 @@ static bool contain_subplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool has_parallel_hazard_walker(Node *node,
+				has_parallel_hazard_arg *context);
+static bool parallel_too_dangerous(char proparallel,
+				has_parallel_hazard_arg *context);
+static bool typeid_is_temp(Oid typeid);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1200,63 +1210,159 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
 }
 
 /*****************************************************************************
- *		Check queries for parallel-unsafe constructs
+ *		Check queries for parallel unsafe and/or restricted constructs
  *****************************************************************************/
 
+/*
+ * Check whether a node tree contains parallel hazards.  This is used both
+ * on the entire query tree, to see whether the query can be parallelized at
+ * all, and also to evaluate whether a particular expression is safe to
+ * run in a parallel worker.  We could separate these concerns into two
+ * different functions, but there's enough overlap that it doesn't seem
+ * worthwhile.
+ */
 bool
-contain_parallel_unsafe(Node *node)
+has_parallel_hazard(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	has_parallel_hazard_arg	context;
+
+	context.allow_restricted = allow_restricted;
+	return has_parallel_hazard_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+has_parallel_hazard_walker(Node *node, has_parallel_hazard_arg *context)
 {
 	if (node == NULL)
 		return false;
+
+	/*
+	 * When we're first invoked on a completely unplanned tree, we must
+	 * recurse through Query objects to as to locate parallel-unsafe
+	 * constructs anywhere in the tree.
+	 *
+	 * Later, we'll be called again for specific quals, possibly after
+	 * some planning has been done, we may encounter SubPlan, SubLink,
+	 * or AlternativeSubLink nodes.  Currently, there's no need to recurse
+	 * through these; they can't be unsafe, since we've already cleared
+	 * the entire query of unsafe operations, and they're definitely
+	 * parallel-restricted.
+	 */
+	if (IsA(node, Query))
+	{
+		Query *query = (Query *) node;
+
+		if (query->rowMarks != NULL)
+			return true;
+
+		/* Recurse into subselects */
+		return query_tree_walker(query,
+								 has_parallel_hazard_walker,
+								 context, 0);
+	}
+	else if (IsA(node, SubPlan) || IsA(node, SubLink) ||
+			 IsA(node, AlternativeSubPlan) || IsA(node, Param))
+	{
+		/*
+		 * Since we don't have the ability to push subplans down to workers
+		 * at present, we treat subplan references as parallel-restricted.
+		 */
+		if (!context->allow_restricted)
+			return true;
+	}
+
+	/* This is just a notational convenience for callers. */
+	if (IsA(node, RestrictInfo))
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) node;
+		return has_parallel_hazard_walker((Node *) rinfo->clause, context);
+	}
+
+	/*
+	 * It is an error for a parallel worker to touch a temporary table in any
+	 * way, so we can't handle nodes whose type is the rowtype of such a table.
+	 */
+	if (!context->allow_restricted)
+	{
+		switch (nodeTag(node))
+		{
+			case T_Var:
+			case T_Const:
+			case T_Param:
+			case T_Aggref:
+			case T_WindowFunc:
+			case T_ArrayRef:
+			case T_FuncExpr:
+			case T_NamedArgExpr:
+			case T_OpExpr:
+			case T_DistinctExpr:
+			case T_NullIfExpr:
+			case T_FieldSelect:
+			case T_FieldStore:
+			case T_RelabelType:
+			case T_CoerceViaIO:
+			case T_ArrayCoerceExpr:
+			case T_ConvertRowtypeExpr:
+			case T_CaseExpr:
+			case T_CaseTestExpr:
+			case T_ArrayExpr:
+			case T_RowExpr:
+			case T_CoalesceExpr:
+			case T_MinMaxExpr:
+			case T_CoerceToDomain:
+			case T_CoerceToDomainValue:
+			case T_SetToDefault:
+				if (typeid_is_temp(exprType(node)))
+					return true;
+				break;
+			default:
+				break;
+		}
+	}
+
+	/*
+	 * For each node that might potentially call a function, we need to
+	 * examine the pg_proc.proparallel marking for that function to see
+	 * whether it's safe enough for the current value of allow_restricted.
+	 */
 	if (IsA(node, FuncExpr))
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, OpExpr))
 	{
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, DistinctExpr))
 	{
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, NullIfExpr))
 	{
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, ScalarArrayOpExpr))
 	{
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, CoerceViaIO))
 	{
@@ -1268,54 +1374,61 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, ArrayCoerceExpr))
 	{
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, RowCompareExpr))
 	{
-		/* RowCompare probably can't have volatile ops, but check anyway */
 		RowCompareExpr *rcexpr = (RowCompareExpr *) node;
 		ListCell   *opid;
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			Oid	opfuncid = get_opcode(lfirst_oid(opid));
+			if (parallel_too_dangerous(func_parallel(opfuncid), context))
 				return true;
 		}
-		/* else fall through to check args */
 	}
-	else if (IsA(node, Query))
-	{
-		Query *query = (Query *) node;
 
-		if (query->rowMarks != NULL)
-			return true;
-
-		/* Recurse into subselects */
-		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
-								 context, 0);
-	}
+	/* ... and recurse to check substructure */
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  has_parallel_hazard_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, has_parallel_hazard_arg *context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+static bool
+typeid_is_temp(Oid typeid)
+{
+	Oid				relid = get_typ_typrelid(typeid);
+
+	if (!OidIsValid(relid))
+		return false;
+
+	return (get_rel_persistence(relid) == RELPERSISTENCE_TEMP);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 68a93a1..996b7fe 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -102,6 +103,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	/* cheap startup cost is interesting iff not all tuples to be retrieved */
 	rel->consider_startup = (root->tuple_fraction > 0);
 	rel->consider_param_startup = false;		/* might get changed later */
+	rel->consider_parallel = false;				/* might get changed later */
 	rel->reltargetlist = NIL;
 	rel->pathlist = NIL;
 	rel->ppilist = NIL;
@@ -363,6 +365,7 @@ build_join_rel(PlannerInfo *root,
 	/* cheap startup cost is interesting iff not all tuples to be retrieved */
 	joinrel->consider_startup = (root->tuple_fraction > 0);
 	joinrel->consider_param_startup = false;
+	joinrel->consider_parallel = false;
 	joinrel->reltargetlist = NIL;
 	joinrel->pathlist = NIL;
 	joinrel->ppilist = NIL;
@@ -442,6 +445,24 @@ build_join_rel(PlannerInfo *root,
 							   sjinfo, restrictlist);
 
 	/*
+	 * Set the consider_parallel flag if this joinrel could potentially be
+	 * scanned within a parallel worker.  If this flag is false for either
+	 * inner_rel or outer_rel, then it must be false for the joinrel also.
+	 * Even if both are true, there might be parallel-restricted quals at
+	 * our level.
+	 *
+	 * Note that if there are more than two rels in this relation, they
+	 * could be divided between inner_rel and outer_rel in any arbitary
+	 * way.  We assume this doesn't matter, because we should hit all the
+	 * same baserels and joinclauses while building up to this joinrel no
+	 * matter which we take; therefore, we should make the same decision
+	 * here however we get here.
+	 */
+	if (inner_rel->consider_parallel && outer_rel->consider_parallel &&
+		!has_parallel_hazard((Node *) restrictlist, false))
+		joinrel->consider_parallel = true;
+
+	/*
 	 * Add the joinrel to the query's joinrel list, and store it into the
 	 * auxiliary hashtable if there is one.  NB: GEQO requires us to append
 	 * the new joinrel to the end of the list!
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 8d1cdf1..093da76 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -1787,6 +1787,28 @@ get_rel_tablespace(Oid relid)
 		return InvalidOid;
 }
 
+/*
+ * get_rel_persistence
+ *
+ *		Returns the relpersistence associated with a given relation.
+ */
+char
+get_rel_persistence(Oid relid)
+{
+	HeapTuple		tp;
+	Form_pg_class	reltup;
+	char 			result;
+
+	tp = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+	if (!HeapTupleIsValid(tp))
+		elog(ERROR, "cache lookup failed for relation %u", relid);
+	reltup = (Form_pg_class) GETSTRUCT(tp);
+	result = reltup->relpersistence;
+	ReleaseSysCache(tp);
+
+	return result;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6cf2e24..41be9b1 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -452,6 +452,7 @@ typedef struct RelOptInfo
 	/* per-relation planner control flags */
 	bool		consider_startup;		/* keep cheap-startup-cost paths? */
 	bool		consider_param_startup; /* ditto, for parameterized paths? */
+	bool		consider_parallel;		/* consider parallel paths? */
 
 	/* materialization information */
 	List	   *reltargetlist;	/* Vars to be output by scan of relation */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..323f093 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,7 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool has_parallel_hazard(Node *node, bool allow_restricted);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 450d9fe..dcc421f 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -103,6 +103,7 @@ extern Oid	get_rel_namespace(Oid relid);
 extern Oid	get_rel_type_id(Oid relid);
 extern char get_rel_relkind(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
+extern char get_rel_persistence(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
#421Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#399)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Oct 15, 2015 at 8:23 PM, Noah Misch <noah@leadboat.com> wrote:

Agreed. More specifically, I had in mind for copyParamList() to check the
mask while e.g. ExecEvalParamExtern() would either check nothing or merely
assert that any mask included the requested parameter. It would be tricky to
verify that as safe, so ...

Would it work to define this as "if non-NULL,
params lacking a 1-bit may be safely ignored"? Or some other tweak
that basically says that you don't need to care about this, but you
can if you want to.

... this is a better specification.

Here's an attempt to implement that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

serialize-paramlistinfo-fixes.patchapplication/x-patch; name=serialize-paramlistinfo-fixes.patchDownload
commit 1ffc4a3a1bac686b46d47dfa40bd0eb3cb8b0be4
Author: Robert Haas <rhaas@postgresql.org>
Date:   Thu Oct 22 23:56:51 2015 -0400

    Fix problems with ParamListInfo serialization mechanism.
    
    Commit d1b7c1ffe72e86932b5395f29e006c3f503bc53d introduced a mechanism
    for serializing a ParamListInfo structure to be passed to a parallel
    worker.  However, this mechanism failed to handle external expanded
    values, as pointed out by Noah Misch.  Moreover, plpgsql_param_fetch
    requires adjustment because the serialization mechanism needs it to skip
    evaluating unused parameters just as we would do when it is called from
    copyParamList, but params == estate->paramLI in that case.  To fix,
    have setup_param_list set a new ParamListInfo field, paramMask, to
    the parameters actually used in the expression, so that we don't try
    to fetch those that are not needed.

diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index fb33d30..0d4aa69 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -392,6 +392,7 @@ EvaluateParams(PreparedStatement *pstmt, List *params,
 	paramLI->parserSetup = NULL;
 	paramLI->parserSetupArg = NULL;
 	paramLI->numParams = num_params;
+	paramLI->paramMask = NULL;
 
 	i = 0;
 	foreach(l, exprstates)
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..0919c04 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -910,6 +910,7 @@ postquel_sub_params(SQLFunctionCachePtr fcache,
 			paramLI->parserSetup = NULL;
 			paramLI->parserSetupArg = NULL;
 			paramLI->numParams = nargs;
+			paramLI->paramMask = NULL;
 			fcache->paramLI = paramLI;
 		}
 		else
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 300401e..13ddb8f 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -2330,6 +2330,7 @@ _SPI_convert_params(int nargs, Oid *argtypes,
 		paramLI->parserSetup = NULL;
 		paramLI->parserSetupArg = NULL;
 		paramLI->numParams = nargs;
+		paramLI->paramMask = NULL;
 
 		for (i = 0; i < nargs; i++)
 		{
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index d093263..9f5fd3a 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -15,6 +15,7 @@
 
 #include "postgres.h"
 
+#include "nodes/bitmapset.h"
 #include "nodes/params.h"
 #include "storage/shmem.h"
 #include "utils/datum.h"
@@ -50,6 +51,7 @@ copyParamList(ParamListInfo from)
 	retval->parserSetup = NULL;
 	retval->parserSetupArg = NULL;
 	retval->numParams = from->numParams;
+	retval->paramMask = bms_copy(from->paramMask);
 
 	for (i = 0; i < from->numParams; i++)
 	{
@@ -58,6 +60,20 @@ copyParamList(ParamListInfo from)
 		int16		typLen;
 		bool		typByVal;
 
+		/*
+		 * Ignore parameters we don't need, to save cycles and space, and
+		 * in case the fetch hook might fail.
+		 */
+		if (retval->paramMask != NULL &&
+			!bms_is_member(i, retval->paramMask))
+		{
+			nprm->value = (Datum) 0;
+			nprm->isnull = true;
+			nprm->pflags = 0;
+			nprm->ptype = InvalidOid;
+			continue;
+		}
+
 		/* give hook a chance in case parameter is dynamic */
 		if (!OidIsValid(oprm->ptype) && from->paramFetch != NULL)
 			(*from->paramFetch) (from, i + 1);
@@ -90,19 +106,31 @@ EstimateParamListSpace(ParamListInfo paramLI)
 	for (i = 0; i < paramLI->numParams; i++)
 	{
 		ParamExternData *prm = &paramLI->params[i];
+		Oid			typeOid;
 		int16		typLen;
 		bool		typByVal;
 
-		/* give hook a chance in case parameter is dynamic */
-		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
-			(*paramLI->paramFetch) (paramLI, i + 1);
+		/*
+		 * Ignore parameters we don't need, to save cycles and space, and
+		 * in case the fetch hook might fail.
+		 */
+		if (paramLI->paramMask != NULL &&
+			!bms_is_member(i, paramLI->paramMask))
+			typeOid = InvalidOid;
+		else
+		{
+			/* give hook a chance in case parameter is dynamic */
+			if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+				(*paramLI->paramFetch) (paramLI, i + 1);
+			typeOid = prm->ptype;
+		}
 
 		sz = add_size(sz, sizeof(Oid));			/* space for type OID */
 		sz = add_size(sz, sizeof(uint16));		/* space for pflags */
 
 		/* space for datum/isnull */
-		if (OidIsValid(prm->ptype))
-			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		if (OidIsValid(typeOid))
+			get_typlenbyval(typeOid, &typLen, &typByVal);
 		else
 		{
 			/* If no type OID, assume by-value, like copyParamList does. */
@@ -150,15 +178,27 @@ SerializeParamList(ParamListInfo paramLI, char **start_address)
 	for (i = 0; i < nparams; i++)
 	{
 		ParamExternData *prm = &paramLI->params[i];
+		Oid			typeOid;
 		int16		typLen;
 		bool		typByVal;
 
-		/* give hook a chance in case parameter is dynamic */
-		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
-			(*paramLI->paramFetch) (paramLI, i + 1);
+		/*
+		 * Ignore parameters we don't need, to save cycles and space, and
+		 * in case the fetch hook might fail.
+		 */
+		if (paramLI->paramMask != NULL &&
+			!bms_is_member(i, paramLI->paramMask))
+			typeOid = InvalidOid;
+		else
+		{
+			/* give hook a chance in case parameter is dynamic */
+			if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+				(*paramLI->paramFetch) (paramLI, i + 1);
+			typeOid = prm->ptype;
+		}
 
 		/* Write type OID. */
-		memcpy(*start_address, &prm->ptype, sizeof(Oid));
+		memcpy(*start_address, &typeOid, sizeof(Oid));
 		*start_address += sizeof(Oid);
 
 		/* Write flags. */
@@ -166,8 +206,8 @@ SerializeParamList(ParamListInfo paramLI, char **start_address)
 		*start_address += sizeof(uint16);
 
 		/* Write datum/isnull. */
-		if (OidIsValid(prm->ptype))
-			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		if (OidIsValid(typeOid))
+			get_typlenbyval(typeOid, &typLen, &typByVal);
 		else
 		{
 			/* If no type OID, assume by-value, like copyParamList does. */
@@ -209,6 +249,7 @@ RestoreParamList(char **start_address)
 	paramLI->parserSetup = NULL;
 	paramLI->parserSetupArg = NULL;
 	paramLI->numParams = nparams;
+	paramLI->paramMask = NULL;
 
 	for (i = 0; i < nparams; i++)
 	{
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d30fe35..f11a715 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1629,6 +1629,7 @@ exec_bind_message(StringInfo input_message)
 		params->parserSetup = NULL;
 		params->parserSetupArg = NULL;
 		params->numParams = numParams;
+		params->paramMask = NULL;
 
 		for (paramno = 0; paramno < numParams; paramno++)
 		{
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 3d9e354..0d61950 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -264,6 +264,11 @@ datumEstimateSpace(Datum value, bool isnull, bool typByVal, int typLen)
 		/* no need to use add_size, can't overflow */
 		if (typByVal)
 			sz += sizeof(Datum);
+		else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+		{
+			ExpandedObjectHeader *eoh = DatumGetEOHP(value);
+			sz += EOH_get_flat_size(eoh);
+		}
 		else
 			sz += datumGetSize(value, typByVal, typLen);
 	}
@@ -292,6 +297,7 @@ void
 datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 			   char **start_address)
 {
+	ExpandedObjectHeader *eoh = NULL;
 	int		header;
 
 	/* Write header word. */
@@ -299,6 +305,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 		header = -2;
 	else if (typByVal)
 		header = -1;
+	else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+	{
+		eoh = DatumGetEOHP(value);
+		header = EOH_get_flat_size(eoh);
+	}
 	else
 		header = datumGetSize(value, typByVal, typLen);
 	memcpy(*start_address, &header, sizeof(int));
@@ -312,6 +323,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 			memcpy(*start_address, &value, sizeof(Datum));
 			*start_address += sizeof(Datum);
 		}
+		else if (eoh)
+		{
+			EOH_flatten_into(eoh, (void *) *start_address, header);
+			*start_address += header;
+		}
 		else
 		{
 			memcpy(*start_address, DatumGetPointer(value), header);
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index 83bebde..2beae5f 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -14,7 +14,8 @@
 #ifndef PARAMS_H
 #define PARAMS_H
 
-/* To avoid including a pile of parser headers, reference ParseState thus: */
+/* Forward declarations, to avoid including other headers */
+struct Bitmapset;
 struct ParseState;
 
 
@@ -71,6 +72,7 @@ typedef struct ParamListInfoData
 	ParserSetupHook parserSetup;	/* parser setup hook */
 	void	   *parserSetupArg;
 	int			numParams;		/* number of ParamExternDatas following */
+	struct Bitmapset *paramMask; /* if non-NULL, can ignore omitted params */
 	ParamExternData params[FLEXIBLE_ARRAY_MEMBER];
 }	ParamListInfoData;
 
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index c73f20b..dc8e746 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -3287,6 +3287,7 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
 	estate->paramLI->parserSetup = (ParserSetupHook) plpgsql_parser_setup;
 	estate->paramLI->parserSetupArg = NULL;		/* filled during use */
 	estate->paramLI->numParams = estate->ndatums;
+	estate->paramLI->paramMask = NULL;
 	estate->params_dirty = false;
 
 	/* set up for use of appropriate simple-expression EState and cast hash */
@@ -5559,6 +5560,12 @@ setup_param_list(PLpgSQL_execstate *estate, PLpgSQL_expr *expr)
 		paramLI->parserSetupArg = (void *) expr;
 
 		/*
+		 * Allow parameters that aren't needed by this expression to be
+		 * ignored.
+		 */
+		paramLI->paramMask = expr->paramnos;
+
+		/*
 		 * Also make sure this is set before parser hooks need it.  There is
 		 * no need to save and restore, since the value is always correct once
 		 * set.  (Should be set already, but let's be sure.)
@@ -5623,6 +5630,7 @@ setup_unshared_param_list(PLpgSQL_execstate *estate, PLpgSQL_expr *expr)
 		paramLI->parserSetup = (ParserSetupHook) plpgsql_parser_setup;
 		paramLI->parserSetupArg = (void *) expr;
 		paramLI->numParams = estate->ndatums;
+		paramLI->paramMask = NULL;
 
 		/*
 		 * Instantiate values for "safe" parameters of the expression.  We
#422Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#421)
Re: Parallel Seq Scan

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Oct 15, 2015 at 8:23 PM, Noah Misch <noah@leadboat.com> wrote:

Would it work to define this as "if non-NULL,
params lacking a 1-bit may be safely ignored"? Or some other tweak
that basically says that you don't need to care about this, but you
can if you want to.

... this is a better specification.

Here's an attempt to implement that.

BTW, my Salesforce colleagues have been bit^H^H^Hgriping for quite some
time about the performance costs associated with translating between
plpgsql's internal PLpgSQL_datum-array format and the ParamListInfo
representation. Maybe it's time to think about some wholesale redesign of
ParamListInfo? Because TBH this patch doesn't seem like much but a kluge.
It's mostly layering still-another bunch of ad-hoc restrictions on
copyParamList, without removing any one of the kluges we had already.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#423Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#419)
Re: Parallel Seq Scan

On Tue, Oct 20, 2015 at 3:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have rebased the partial seq scan patch based on the above committed
patches. Now for rescanning it reuses the dsm and to achieve that we
need to ensure that workers have been completely shutdown and then
reinitializes the dsm. To ensure complete shutdown of workers, current
function WaitForParallelWorkersToFinish is not sufficient as that just
waits for the last message to receive from worker backend, so I have
written a new function WaitForParallelWorkersToDie. Also on receiving
'X' message in HandleParallelMessage, it just frees the worker handle
without ensuring if the worker is died due to which later it will be
difficult
to even find whether worker is died or not, so I have removed that code
from HandleParallelMessage. Another change is that after receiving last
tuple in Gather node, it just shutdown down the workers without
destroying the dsm.

+               /*
+                * We can't finish transaction commit or abort until all of the
+                * workers are dead.  This means, in particular, that
we can't respond
+                * to interrupts at this stage.
+                */
+               HOLD_INTERRUPTS();
+               status =
WaitForBackgroundWorkerShutdown(pcxt->worker[i].bgwhandle);
+               RESUME_INTERRUPTS();

These comments are correct when this code is called from
DestroyParallelContext(), but they're flat wrong when called from
ReinitializeParallelDSM(). I suggest moving the comment back to
DestroyParallelContext and following it with this:

HOLD_INTERRUPTS();
WaitForParallelWorkersToDie();
RESUME_INTERRUPTS();

Then ditch the HOLD/RESUME interrupts in WaitForParallelWorkersToDie() itself.

This hunk is a problem:

case 'X': /* Terminate,
indicating clean exit */
{
- pfree(pcxt->worker[i].bgwhandle);
pfree(pcxt->worker[i].error_mqh);
- pcxt->worker[i].bgwhandle = NULL;
pcxt->worker[i].error_mqh = NULL;
break;
}

If you do that on receipt of the 'X' message, then
DestroyParallelContext() might SIGTERM a worker that has supposedly
exited cleanly. That seems bad. I think maybe the solution is to
make DestroyParallelContext() terminate the worker only if
pcxt->worker[i].error_mqh != NULL. So make error_mqh == NULL mean a
clean loss of a worker: either we couldn't register it, or it exited
cleanly. And bgwhandle == NULL would mean it's actually gone.

It makes sense to have ExecShutdownGather and
ExecShutdownGatherWorkers, but couldn't the former call the latter
instead of duplicating the code?

I think ReInitialize should be capitalized as Reinitialize throughout.

ExecParallelReInitializeTupleQueues is almost a cut-and-paste
duplicate of ExecParallelSetupTupleQueues. Please refactor this to
avoid duplication - e.g. change
ExecParallelSetupTupleQueues(ParallelContext *pcxt) to take a second
argument bool reinit. ExecParallelReInitializeTupleQueues can just do
ExecParallelSetupTupleQueues(pxct, true).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#424Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#422)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 12:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, my Salesforce colleagues have been bit^H^H^Hgriping for quite some
time about the performance costs associated with translating between
plpgsql's internal PLpgSQL_datum-array format and the ParamListInfo
representation. Maybe it's time to think about some wholesale redesign of
ParamListInfo? Because TBH this patch doesn't seem like much but a kluge.
It's mostly layering still-another bunch of ad-hoc restrictions on
copyParamList, without removing any one of the kluges we had already.

I have no objection to some kind of a redesign there, but (1) I don't
think we're going to be better off doing that before getting Partial
Seq Scan committed and (2) I don't think I'm the best-qualified person
to do the work. With respect to the first point, despite my best
efforts, this feature is going to have bugs, and getting it committed
in November without a ParamListInfo redesign is surely going to be
better for the overall stability of PostgreSQL and the timeliness of
our release schedule than getting it committed in February with such a
redesign -- never mind that this is far from the only redesign into
which I could get sucked. I want to put in place some narrow fix for
this issue so that I can move forward. Three alternatives have been
proposed so far: (1) this, (2) the fix I coded and posted previously,
which made plpgsql_param_fetch's bms_is_member test unconditional, and
(3) not allowing PL/pgsql to run parallel queries. (3) sounds worse
to me than either (1) or (2); I defer to others on which of (1) and
(2) is preferable, or perhaps you have another proposal.

On the second point, I really don't know enough about the problems
with ParamListInfo to know what would be better, so I can't really
help there. If you do and want to redesign it, fine, but I really
need whatever you replace it with have an easy way of serializing and
restoring it - be it nodeToString() and stringToNode(),
SerializeParamList and RestoreParamList, or whatever. Without that,
parallel query is going to have to be disabled for any query involving
parameters, and that would be, uh, extremely sad. Also, FWIW, in my
opinion, it would be far more useful to PostgreSQL for you to finish
the work on upper planner path-ification ... an awful lot of people
are waiting for that to be completed to start their own work, or are
doing work that may have to be completely redone when that lands.
YMMV, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#425Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#420)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 5:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 13, 2015 at 5:59 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

- Although the changes in parallelpaths.c are in a good direction, I'm
pretty sure this is not yet up to scratch. I am less sure exactly
what needs to be fixed, so I'll have to give some more thought to
that.

Please find attached a proposed set of changes that I think are
better. These changes compute a consider_parallel flag for each
RelOptInfo, which is true if it's a non-temporary relation whose
baserestrictinfo references no PARAM_EXEC parameters, sublinks, or
parallel-restricted functions. Actually, I made an effort to set the
flag correctly even for baserels other than plain tables, and for
joinrels, though we don't technically need that stuff until we get to
the point of pushing joins beneath Gather nodes. When we get there,
it will be important - any joinrel for which consider_parallel = false
needn't even try to generate parallel paths, while if
consider_parallel = true then we can consider it, if the costing makes
sense.

Considering parallelism at RelOptInfo level in the way as done in patch,
won't consider the RelOptInfo's for child relations in case of Append node.
Refer build_simple_rel().

Also for cases when parallelism is not enabled like max_parallel_degree = 0,
the current way of doing could add an overhead of traversing the
baserestrictinfo without need. I think one way to avoid that would be check
that while setting parallelModeOK flag.

Another point is that it will consider parallelism for cases where we really
can't parallelize example for foreign table, sample scan.

One thing to note here is that we already have precedent of verifying qual
push down safety while path generation (during subquery path generation),
so it doesn't seem wrong to consider the same for parallel paths and it
would
minimize the cases where we need to evaluate parallelism.

The advantage of this is that the logic is centralized. If we have
parallel seq scan and also, say, parallel bitmap heap scan, your
approach would require that we duplicate the logic to check for
parallel-restricted functions for each path generation function.

Don't we anyway need that irrespective of caching it in RelOptInfo?
During bitmappath creation, bitmapqual could contain something
which needs to be evaluated for parallel-safety as it is built based
on index paths which inturn can be based on some join clause. As per
patch, the join clause parallel-safety is checked much later than
generation bitmappath.

+ else if (IsA(node, SubPlan) || IsA(node, SubLink) ||
+ IsA(node, AlternativeSubPlan) || IsA(node, Param))
+ {
+ /*
+ * Since we don't have the ability to push subplans down to workers
+ * at present, we treat subplan references as parallel-restricted.
+ */
+ if (!context->allow_restricted)
+ return true;
+ }

I think it is better to do this for PARAM_EXEC paramkind, as those are
the cases where it would be subplan or initplan.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#426Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#423)
1 attachment(s)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 10:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

+               /*
+                * We can't finish transaction commit or abort until all

of the

+                * workers are dead.  This means, in particular, that
we can't respond
+                * to interrupts at this stage.
+                */
+               HOLD_INTERRUPTS();
+               status =
WaitForBackgroundWorkerShutdown(pcxt->worker[i].bgwhandle);
+               RESUME_INTERRUPTS();

These comments are correct when this code is called from
DestroyParallelContext(), but they're flat wrong when called from
ReinitializeParallelDSM(). I suggest moving the comment back to
DestroyParallelContext and following it with this:

HOLD_INTERRUPTS();
WaitForParallelWorkersToDie();
RESUME_INTERRUPTS();

Then ditch the HOLD/RESUME interrupts in WaitForParallelWorkersToDie()

itself.

Changed as per suggestion.

This hunk is a problem:

case 'X': /* Terminate,
indicating clean exit */
{
- pfree(pcxt->worker[i].bgwhandle);
pfree(pcxt->worker[i].error_mqh);
- pcxt->worker[i].bgwhandle = NULL;
pcxt->worker[i].error_mqh = NULL;
break;
}

If you do that on receipt of the 'X' message, then
DestroyParallelContext() might SIGTERM a worker that has supposedly
exited cleanly. That seems bad. I think maybe the solution is to
make DestroyParallelContext() terminate the worker only if
pcxt->worker[i].error_mqh != NULL.

Changed as per suggestion.

So make error_mqh == NULL mean a
clean loss of a worker: either we couldn't register it, or it exited
cleanly. And bgwhandle == NULL would mean it's actually gone.

I think even if error_mqh is NULL, it not guarnteed that the worker has
exited, it ensures that clean worker shutdown is either in-progress or
done.

It makes sense to have ExecShutdownGather and
ExecShutdownGatherWorkers, but couldn't the former call the latter
instead of duplicating the code?

makes sense, so changed accordingly.

I think ReInitialize should be capitalized as Reinitialize throughout.

Changed as per suggestion.

ExecParallelReInitializeTupleQueues is almost a cut-and-paste
duplicate of ExecParallelSetupTupleQueues. Please refactor this to
avoid duplication - e.g. change
ExecParallelSetupTupleQueues(ParallelContext *pcxt) to take a second
argument bool reinit. ExecParallelReInitializeTupleQueues can just do
ExecParallelSetupTupleQueues(pxct, true).

Changed as per suggestion.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_partialseqscan_v23.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v23.patchDownload
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 35a873d..2b80cc9 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -110,6 +110,7 @@ static void HandleParallelMessage(ParallelContext *, int, StringInfo msg);
 static void ParallelErrorContext(void *arg);
 static void ParallelExtensionTrampoline(dsm_segment *seg, shm_toc *toc);
 static void ParallelWorkerMain(Datum main_arg);
+static void WaitForParallelWorkersToDie(ParallelContext *pcxt);
 
 /*
  * Establish a new parallel context.  This should be done after entering
@@ -384,36 +385,20 @@ InitializeParallelDSM(ParallelContext *pcxt)
 }
 
 /*
- * Launch parallel workers.
+ * Reinitialize the dynamic shared memory segment for a parallel context such
+ * that it could be reused by launcing the workers again.
  */
 void
-LaunchParallelWorkers(ParallelContext *pcxt)
+ReinitializeParallelDSM(ParallelContext *pcxt)
 {
-	MemoryContext oldcontext;
-	BackgroundWorker worker;
-	int			i;
-	bool		any_registrations_failed = false;
-
-	/* Skip this if we have no workers. */
-	if (pcxt->nworkers == 0)
-		return;
-
-	/* If we do have workers, we'd better have a DSM segment. */
-	Assert(pcxt->seg != NULL);
-
-	/* We might be running in a short-lived memory context. */
-	oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+	WaitForParallelWorkersToDie(pcxt);
 
-	/*
-	 * This function can be called for a parallel context for which it has
-	 * already been called previously, but only if all of the old workers
-	 * have already exited.  When this case arises, we need to do some extra
-	 * reinitialization.
-	 */
+	/* Reinitialize parallel context */
 	if (pcxt->nworkers_launched > 0)
 	{
 		FixedParallelState *fps;
 		char	   *error_queue_space;
+		int			i;
 
 		/* Clean out old worker handles. */
 		for (i = 0; i < pcxt->nworkers; ++i)
@@ -449,6 +434,28 @@ LaunchParallelWorkers(ParallelContext *pcxt)
 		/* Reset number of workers launched. */
 		pcxt->nworkers_launched = 0;
 	}
+}
+
+/*
+ * Launch parallel workers.
+ */
+void
+LaunchParallelWorkers(ParallelContext *pcxt)
+{
+	MemoryContext oldcontext;
+	BackgroundWorker worker;
+	int			i;
+	bool		any_registrations_failed = false;
+
+	/* Skip this if we have no workers. */
+	if (pcxt->nworkers == 0)
+		return;
+
+	/* If we do have workers, we'd better have a DSM segment. */
+	Assert(pcxt->seg != NULL);
+
+	/* We might be running in a short-lived memory context. */
+	oldcontext = MemoryContextSwitchTo(TopTransactionContext);
 
 	/* Configure a worker. */
 	snprintf(worker.bgw_name, BGW_MAXLEN, "parallel worker for PID %d",
@@ -553,6 +560,46 @@ WaitForParallelWorkersToFinish(ParallelContext *pcxt)
 }
 
 /*
+ * Wait for all workers to die.
+ *
+ * This function ensures that workers have been completely shutdown.  The
+ * difference between WaitForParallelWorkersToFinish and this function is
+ * that former just ensures that last message sent by worker backend is
+ * received by master backend whereas this ensures the complete shutdown.
+ */
+static void
+WaitForParallelWorkersToDie(ParallelContext *pcxt)
+{
+	int			i;
+
+	/* Wait until the workers actually die. */
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		BgwHandleStatus status;
+
+		if (pcxt->worker == NULL || pcxt->worker[i].bgwhandle == NULL)
+			continue;
+
+		status = WaitForBackgroundWorkerShutdown(pcxt->worker[i].bgwhandle);
+
+		/*
+		 * If the postmaster kicked the bucket, we have no chance of cleaning
+		 * up safely -- we won't be able to tell when our workers are actually
+		 * dead.  This doesn't necessitate a PANIC since they will all abort
+		 * eventually, but we can't safely continue this session.
+		 */
+		if (status == BGWH_POSTMASTER_DIED)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("postmaster exited during a parallel transaction")));
+
+		/* Release memory. */
+		pfree(pcxt->worker[i].bgwhandle);
+		pcxt->worker[i].bgwhandle = NULL;
+	}
+}
+
+/*
  * Destroy a parallel context.
  *
  * If expecting a clean exit, you should use WaitForParallelWorkersToFinish()
@@ -578,10 +625,10 @@ DestroyParallelContext(ParallelContext *pcxt)
 	{
 		for (i = 0; i < pcxt->nworkers; ++i)
 		{
-			if (pcxt->worker[i].bgwhandle != NULL)
-				TerminateBackgroundWorker(pcxt->worker[i].bgwhandle);
 			if (pcxt->worker[i].error_mqh != NULL)
 			{
+				TerminateBackgroundWorker(pcxt->worker[i].bgwhandle);
+
 				pfree(pcxt->worker[i].error_mqh);
 				pcxt->worker[i].error_mqh = NULL;
 			}
@@ -609,38 +656,14 @@ DestroyParallelContext(ParallelContext *pcxt)
 		pcxt->private_memory = NULL;
 	}
 
-	/* Wait until the workers actually die. */
-	for (i = 0; i < pcxt->nworkers; ++i)
-	{
-		BgwHandleStatus status;
-
-		if (pcxt->worker == NULL || pcxt->worker[i].bgwhandle == NULL)
-			continue;
-
-		/*
-		 * We can't finish transaction commit or abort until all of the
-		 * workers are dead.  This means, in particular, that we can't respond
-		 * to interrupts at this stage.
-		 */
-		HOLD_INTERRUPTS();
-		status = WaitForBackgroundWorkerShutdown(pcxt->worker[i].bgwhandle);
-		RESUME_INTERRUPTS();
-
-		/*
-		 * If the postmaster kicked the bucket, we have no chance of cleaning
-		 * up safely -- we won't be able to tell when our workers are actually
-		 * dead.  This doesn't necessitate a PANIC since they will all abort
-		 * eventually, but we can't safely continue this session.
-		 */
-		if (status == BGWH_POSTMASTER_DIED)
-			ereport(FATAL,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-				 errmsg("postmaster exited during a parallel transaction")));
-
-		/* Release memory. */
-		pfree(pcxt->worker[i].bgwhandle);
-		pcxt->worker[i].bgwhandle = NULL;
-	}
+	/*
+	 * We can't finish transaction commit or abort until all of the
+	 * workers are dead.  This means, in particular, that we can't respond
+	 * to interrupts at this stage.
+	 */
+	HOLD_INTERRUPTS();
+	WaitForParallelWorkersToDie(pcxt);
+	RESUME_INTERRUPTS();
 
 	/* Free the worker array itself. */
 	if (pcxt->worker != NULL)
@@ -799,9 +822,7 @@ HandleParallelMessage(ParallelContext *pcxt, int i, StringInfo msg)
 
 		case 'X':				/* Terminate, indicating clean exit */
 			{
-				pfree(pcxt->worker[i].bgwhandle);
 				pfree(pcxt->worker[i].error_mqh);
-				pcxt->worker[i].bgwhandle = NULL;
 				pcxt->worker[i].error_mqh = NULL;
 				break;
 			}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..d03fbde 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -729,6 +729,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -850,6 +851,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
@@ -1005,6 +1009,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
@@ -1270,6 +1275,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2353,6 +2359,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..38a92fe 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,8 +21,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 163650c..b3d041c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -157,6 +158,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
@@ -468,6 +473,9 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_PartialSeqScan:
+			return false;
+
 		case T_SampleScan:
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..6e05598 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
 		case T_SampleScanState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index efcbaef..477823e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -25,6 +25,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -84,7 +85,7 @@ static bool ExecParallelEstimate(PlanState *node,
 					 ExecParallelEstimateContext *e);
 static bool ExecParallelInitializeDSM(PlanState *node,
 					 ExecParallelInitializeDSMContext *d);
-static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt);
+static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt, bool reinit);
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
 						  SharedExecutorInstrumentation *instrumentation);
 
@@ -166,10 +167,16 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	/* Count this node. */
 	e->nnodes++;
 
-	/*
-	 * XXX. Call estimators for parallel-aware nodes here, when we have
-	 * some.
-	 */
+	/* Call estimators for parallel-aware nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanEstimate((PartialSeqScanState *) planstate,
+									   e->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
@@ -204,10 +211,16 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/*
-	 * XXX. Call initializers for parallel-aware plan nodes, when we have
-	 * some.
-	 */
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanInitializeDSM((PartialSeqScanState *) planstate,
+											d->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
 }
@@ -217,7 +230,7 @@ ExecParallelInitializeDSM(PlanState *planstate,
  * to the main backend and start the workers.
  */
 static shm_mq_handle **
-ExecParallelSetupTupleQueues(ParallelContext *pcxt)
+ExecParallelSetupTupleQueues(ParallelContext *pcxt, bool reinit)
 {
 	shm_mq_handle **responseq;
 	char	   *tqueuespace;
@@ -231,9 +244,17 @@ ExecParallelSetupTupleQueues(ParallelContext *pcxt)
 	responseq = (shm_mq_handle **)
 		palloc(pcxt->nworkers * sizeof(shm_mq_handle *));
 
-	/* Allocate space from the DSM for the queues themselves. */
-	tqueuespace = shm_toc_allocate(pcxt->toc,
-								 PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	/*
+	 * If not reinitializing, allocate space from the DSM for the queues
+	 * themselves, else find the already allocated space.
+	 */
+	if (!reinit)
+	{
+		tqueuespace = shm_toc_allocate(pcxt->toc,
+									 PARALLEL_TUPLE_QUEUE_SIZE * pcxt->nworkers);
+	}
+	else
+		tqueuespace = shm_toc_lookup(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE);
 
 	/* Create the queues, and become the receiver for each. */
 	for (i = 0; i < pcxt->nworkers; ++i)
@@ -248,13 +269,24 @@ ExecParallelSetupTupleQueues(ParallelContext *pcxt)
 	}
 
 	/* Add array of queues to shm_toc, so others can find it. */
-	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tqueuespace);
+	if (!reinit)
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLE_QUEUE, tqueuespace);
 
 	/* Return array of handles. */
 	return responseq;
 }
 
 /*
+ * It re-initializes the response queues for backend workers to return tuples
+ * to the main backend and start the workers.
+ */
+shm_mq_handle **
+ExecParallelReinitializeTupleQueues(ParallelContext *pcxt)
+{
+	return ExecParallelSetupTupleQueues(pcxt, true);
+}
+
+/*
  * Sets up the required infrastructure for backend workers to perform
  * execution and return results to the main backend.
  */
@@ -363,7 +395,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 	pei->buffer_usage = bufusage_space;
 
 	/* Set up tuple queues. */
-	pei->tqueue = ExecParallelSetupTupleQueues(pcxt);
+	pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
 
 	/*
 	 * If instrumentation options were supplied, allocate space for the
@@ -556,6 +588,32 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Initialize the PlanState and it's descendents with the information
+ * retrieved from shared memory.  This has to be done once the PlanState
+ * is allocated and initialized by executor for each node aka after
+ * ExecutorStart().
+ */
+static bool
+ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
+{
+	if (planstate == NULL)
+		return false;
+
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanInitParallelScanDesc((PartialSeqScanState *) planstate,
+												   toc);
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
@@ -591,6 +649,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 
 	/* Start up the executor, have it run the plan, and then shut it down. */
 	ExecutorStart(queryDesc, 0);
+	ExecParallelInitializeWorker(queryDesc->planstate, toc);
 	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
 	ExecutorFinish(queryDesc);
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 6f5c554..1b929c2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -193,6 +194,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_SampleScan:
 			result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
 													  estate, eflags);
@@ -419,6 +425,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
@@ -665,6 +675,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index ef810a5..0b68f1b 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -40,6 +40,7 @@
 
 
 static TupleTableSlot *gather_getnext(GatherState *gatherstate);
+static void ExecShutdownGatherWorkers(GatherState *node);
 
 
 /* ----------------------------------------------------------------
@@ -134,9 +135,10 @@ ExecGather(GatherState *node)
 			bool	got_any_worker = false;
 
 			/* Initialize the workers required to execute Gather node. */
-			node->pei = ExecInitParallelPlan(node->ps.lefttree,
-											 estate,
-											 gather->num_workers);
+			if (!node->pei)
+				node->pei = ExecInitParallelPlan(node->ps.lefttree,
+												 estate,
+												 gather->num_workers);
 
 			/*
 			 * Register backend workers. We might not get as many as we
@@ -224,7 +226,7 @@ gather_getnext(GatherState *gatherstate)
 									   gatherstate->need_to_scan_locally,
 									   &done);
 			if (done)
-				ExecShutdownGather(gatherstate);
+				ExecShutdownGatherWorkers(gatherstate);
 
 			if (HeapTupleIsValid(tup))
 			{
@@ -255,15 +257,15 @@ gather_getnext(GatherState *gatherstate)
 }
 
 /* ----------------------------------------------------------------
- *		ExecShutdownGather
+ *		ExecShutdownGatherWorkers
  *
- *		Destroy the setup for parallel workers.  Collect all the
- *		stats after workers are stopped, else some work done by
- *		workers won't be accounted.
+ *		Destroy the parallel workers.  Collect all the stats after
+ *		workers are stopped, else some work done by workers won't be
+ *		accounted.
  * ----------------------------------------------------------------
  */
 void
-ExecShutdownGather(GatherState *node)
+ExecShutdownGatherWorkers(GatherState *node)
 {
 	/* Shut down tuple queue funnel before shutting down workers. */
 	if (node->funnel != NULL)
@@ -274,8 +276,25 @@ ExecShutdownGather(GatherState *node)
 
 	/* Now shut down the workers. */
 	if (node->pei != NULL)
-	{
 		ExecParallelFinish(node->pei);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecShutdownGather
+ *
+ *		Destroy the setup for parallel workers including parallel context.
+ *		Collect all the stats after workers are stopped, else some work
+ *		done by workers won't be accounted.
+ * ----------------------------------------------------------------
+ */
+void
+ExecShutdownGather(GatherState *node)
+{
+	ExecShutdownGatherWorkers(node);
+
+	/* Now destroy the parallel context. */
+	if (node->pei != NULL)
+	{
 		ExecParallelCleanup(node->pei);
 		node->pei = NULL;
 	}
@@ -296,14 +315,21 @@ void
 ExecReScanGather(GatherState *node)
 {
 	/*
-	 * Re-initialize the parallel context and workers to perform rescan of
-	 * relation.  We want to gracefully shutdown all the workers so that they
+	 * Re-initialize the parallel workers to perform rescan of relation.
+	 * We want to gracefully shutdown all the workers so that they
 	 * should be able to propagate any error or other information to master
-	 * backend before dying.
+	 * backend before dying.  Parallel context will be reused for rescan.
 	 */
-	ExecShutdownGather(node);
+	ExecShutdownGatherWorkers(node);
 
 	node->initialized = false;
 
+	if (node->pei)
+	{
+		ReinitializeParallelDSM(node->pei->pcxt);
+		node->pei->tqueue =
+				ExecParallelReinitializeTupleQueues(node->pei->pcxt);
+	}
+
 	ExecReScan(node->ps.lefttree);
 }
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..bc37b9b
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check are
+	 * keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanEstimate
+ *
+ *		estimates the space required to serialize partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->pscan_len = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+
+	/* key for partial scan information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitializeDSM
+ *
+ *		Initialize the DSM with the contents required to perform
+ *		partial seqscan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Store parallel heap scan descriptor in dynamic shared memory.
+	 */
+	node->pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+	heap_parallelscan_initialize(node->pscan,
+								 node->ss.ss_currentRelation,
+								 estate->es_snapshot);
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id,
+				   node->pscan);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitParallelDesc
+ *
+ *		Retrieve the contents from DSM related to partial seq scan node
+ *		and initialize the partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc)
+{
+	node->pscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize it
+	 * during ExecutorStart phase, however we need ParallelHeapScanDesc to
+	 * initialize the scan in case of this node and the same is initialized by
+	 * the Gather node during ExecutorRun phase.
+	 */
+	if (!node->ss.ss_currentScanDesc)
+	{
+		node->ss.ss_currentScanDesc =
+			heap_beginscan_parallel(node->ss.ss_currentRelation, node->pscan);
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	HeapScanDesc scan;
+
+	scan = node->ss.ss_currentScanDesc;
+
+	if (scan)
+		heap_rescan(scan,		/* scan desc */
+					NULL);		/* new scan keys */
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c176ff9..fdcccef 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -384,6 +384,22 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copySampleScan
  */
 static SampleScan *
@@ -4264,6 +4280,9 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3e75cd1..5ce45e2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -460,6 +460,14 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outSampleScan(StringInfo str, const SampleScan *node)
 {
 	WRITE_NODE_TYPE("SAMPLESCAN");
@@ -3020,6 +3028,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 94ba6dc..3d3448d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1606,6 +1606,19 @@ _readSeqScan(void)
 }
 
 /*
+ * _readPartialSeqScan
+ */
+static PartialSeqScan *
+_readPartialSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(PartialSeqScan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
  * _readSampleScan
  */
 static SampleScan *
@@ -2337,6 +2350,8 @@ parseNodeString(void)
 		return_value = _readScan();
 	else if (MATCH("SEQSCAN", 7))
 		return_value = _readSeqScan();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readPartialSeqScan();
 	else if (MATCH("SAMPLESCAN", 10))
 		return_value = _readSampleScan();
 	else if (MATCH("INDEXSCAN", 9))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1b61fd9..3239cec 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -227,6 +227,49 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_partialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan via the
+	 * ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers. Here assumption is
+	 * that disk access cost will also be equally shared between workers which
+	 * is generally true unless there are too many workers working on a
+	 * relatively lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for partial
+	 * sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_samplescan
  *	  Determines and returns the cost of scanning a relation using sampling.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..ce25cbf
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * expr_is_parallel_safe
+ *	  is a paraticular expression parallel safe
+ *
+ * Conditions checked here:
+ *
+ * 1. The expresion must not contain any parallel unsafe or parallel
+ * restricted functions.
+ *
+ * 2. The expression must not contain any initplan or subplan.  We can
+ * probably remove this restriction once we have support of infrastructure
+ * for execution of initplans and subplans at parallel (Gather) nodes.
+ */
+bool
+expr_is_parallel_safe(Node *node)
+{
+	if (check_parallel_safety(node, false))
+		return false;
+
+	if (contain_subplans_or_initplans(node))
+		return false;
+
+	return true;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+	ListCell   *l;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (max_parallel_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * Allow parallel paths only if all the clauses for relation are parallel
+	 * safe.  We can allow execution of parallel restricted clauses in master
+	 * backend, but for that planner should have infrastructure to pull all
+	 * the parallel restricted clauses from below nodes to the Gather node
+	 * which will then execute such clauses in master backend.
+	 */
+	foreach(l, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (!expr_is_parallel_safe((Node *) rinfo->clause))
+			return;
+	}
+
+	num_parallel_workers = Min(max_parallel_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the gather path which master backend needs to execute. */
+	add_path(rel, (Path *) create_gather_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 791b64e..b142811 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,8 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
 static Gather *create_gather_plan(PlannerInfo *root,
@@ -104,6 +106,8 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+					Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
 static Gather *make_gather(List *qptlist, List *qpqual,
@@ -237,6 +241,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -357,6 +362,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_SampleScan:
 			plan = (Plan *) create_samplescan_plan(root,
 												   best_path,
@@ -567,6 +579,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -1184,6 +1197,46 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan	   *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_samplescan_plan
  *	 Returns a samplescan plan for the base relation scanned by 'best_path'
  *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3481,6 +3534,24 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static SampleScan *
 make_samplescan(List *qptlist,
 				List *qpqual,
@@ -5174,6 +5245,7 @@ is_projection_capable_plan(Plan *plan)
 		case T_Append:
 		case T_MergeAppend:
 		case T_RecursiveUnion:
+		case T_Gather:
 			return false;
 		default:
 			break;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 536b55e..d5329fb 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -202,13 +202,13 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->hasRowSecurity = false;
 
 	/*
-	 * Assess whether it's feasible to use parallel mode for this query.
-	 * We can't do this in a standalone backend, or if the command will
-	 * try to modify any data, or if this is a cursor operation, or if any
+	 * Assess whether it's feasible to use parallel mode for this query. We
+	 * can't do this in a standalone backend, or if the command will try to
+	 * modify any data, or if this is a cursor operation, or if any
 	 * parallel-unsafe functions are present in the query tree.
 	 *
-	 * For now, we don't try to use parallel mode if we're running inside
-	 * a parallel worker.  We might eventually be able to relax this
+	 * For now, we don't try to use parallel mode if we're running inside a
+	 * parallel worker.  We might eventually be able to relax this
 	 * restriction, but for now it seems best not to have parallel workers
 	 * trying to create their own parallel workers.
 	 *
@@ -225,7 +225,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
 		parse->utilityStmt == NULL && !IsParallelWorker() &&
 		!IsolationIsSerializable() &&
-		!contain_parallel_unsafe((Node *) parse);
+		!check_parallel_safety((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
@@ -238,9 +238,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 *
 	 * (It's been suggested that we should always impose these restrictions
 	 * whenever glob->parallelModeOK is true, so that it's easier to notice
-	 * incorrectly-labeled functions sooner.  That might be the right thing
-	 * to do, but for now I've taken this approach.  We could also control
-	 * this with a GUC.)
+	 * incorrectly-labeled functions sooner.  That might be the right thing to
+	 * do, but for now I've taken this approach.  We could also control this
+	 * with a GUC.)
 	 *
 	 * FIXME: It's assumed that code further down will set parallelModeNeeded
 	 * to true if a parallel path is actually chosen.  Since the core
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 8c6c571..36c959c 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -447,6 +447,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 82414d4..99dacde 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2234,6 +2234,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..2355cc6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -87,16 +87,25 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+}	check_parallel_safety_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
 static bool find_window_functions_walker(Node *node, WindowFuncLists *lists);
 static bool expression_returns_set_rows_walker(Node *node, double *count);
 static bool contain_subplans_walker(Node *node, void *context);
+static bool contain_subplans_or_initplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool check_parallel_safety_walker(Node *node,
+							 check_parallel_safety_arg * context);
+static bool parallel_too_dangerous(char proparallel,
+					   check_parallel_safety_arg * context);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1204,13 +1213,16 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
  *****************************************************************************/
 
 bool
-contain_parallel_unsafe(Node *node)
+check_parallel_safety(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	check_parallel_safety_arg context;
+
+	context.allow_restricted = allow_restricted;
+	return check_parallel_safety_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+check_parallel_safety_walker(Node *node, check_parallel_safety_arg * context)
 {
 	if (node == NULL)
 		return false;
@@ -1218,7 +1230,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1227,7 +1239,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1236,7 +1248,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1245,7 +1257,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1254,7 +1266,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1268,12 +1280,12 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1282,7 +1294,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1294,28 +1306,77 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			if (parallel_too_dangerous(op_volatile(lfirst_oid(opid)), context))
 				return true;
 		}
 		/* else fall through to check args */
 	}
 	else if (IsA(node, Query))
 	{
-		Query *query = (Query *) node;
+		Query	   *query = (Query *) node;
 
 		if (query->rowMarks != NULL)
 			return true;
 
 		/* Recurse into subselects */
 		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
+								 check_parallel_safety_walker,
 								 context, 0);
 	}
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  check_parallel_safety_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, check_parallel_safety_arg * context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+/*
+ * contain_subplans_or_initplans
+ *	  Recursively search for initplan or subplan nodes within a clause.
+ *
+ * A special purpose function for prohibiting subplan or initplan clauses
+ * in parallel query constructs.
+ *
+ * If we see any form of SubPlan node, we will return TRUE.  For InitPlan's,
+ * we return true when we see the Param node, apart from that InitPlan
+ * can contain a simple NULL constant for MULTIEXPR subquery (see comments
+ * in make_subplan), however it is okay not to care about the same as that
+ * is only possible for Update statement which is anyway prohibited.
+ *
+ * Returns true if any subplan or initplan is found.
+ */
+bool
+contain_subplans_or_initplans(Node *clause)
+{
+	return contain_subplans_or_initplans_walker(clause, NULL);
+}
+
+static bool
+contain_subplans_or_initplans_walker(Node *node, void *context)
+{
+	if (node == NULL)
+		return false;
+	if (IsA(node, SubPlan) ||
+		IsA(node, AlternativeSubPlan) ||
+		IsA(node, SubLink))
+		return true;			/* abort the tree traversal and return true */
+	else if (IsA(node, Param))
+	{
+		Param	   *paramval = (Param *) node;
+
+		if (paramval->paramkind == PARAM_EXEC)
+			return true;
+	}
+	return expression_tree_walker(node, contain_subplans_or_initplans_walker, context);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1895a68..2fd7ae5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -712,6 +712,28 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_partialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_samplescan_path
  *	  Creates a path node for a sampled table scan.
  */
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index d4b7c5d..411db79 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -56,6 +56,7 @@ extern bool InitializingParallelWorker;
 extern ParallelContext *CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers);
 extern ParallelContext *CreateParallelContextForExternalFunction(char *library_name, char *function_name, int nworkers);
 extern void InitializeParallelDSM(ParallelContext *);
+extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void LaunchParallelWorkers(ParallelContext *);
 extern void WaitForParallelWorkersToFinish(ParallelContext *);
 extern void DestroyParallelContext(ParallelContext *);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 505500e..23c29eb 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -33,5 +33,6 @@ extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
 					 EState *estate, int nworkers);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
 extern void ExecParallelCleanup(ParallelExecutorInfo *pei);
+extern shm_mq_handle **ExecParallelReinitializeTupleQueues(ParallelContext *pcxt);
 
 #endif   /* EXECPARALLEL_H */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..77e5311
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc);
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+					   EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4fcdcc4..d71892e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
@@ -1254,6 +1255,18 @@ typedef struct ScanState
  */
 typedef ScanState SeqScanState;
 
+/*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelHeapScanDesc	pscan;	/* parallel heap scan descriptor
+									 * for partial scan */
+	Size		pscan_len;		/* size of parallel heap scan descriptor */
+} PartialSeqScanState;
+
 /* ----------------
  *	 SampleScanState information
  * ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 94bdb7c..71496b9 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
 	T_SampleScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
 	T_SampleScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6b28c8e..bb41b8e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -72,7 +72,7 @@ typedef struct PlannedStmt
 
 	bool		hasRowSecurity; /* row security applied? */
 
-	bool		parallelModeNeeded; /* parallel mode required to execute? */
+	bool		parallelModeNeeded;		/* parallel mode required to execute? */
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -287,6 +287,12 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
  *		table sample scan node
  * ----------------
  */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..747b05b 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,8 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool check_parallel_safety(Node *node, bool allow_restricted);
+extern bool contain_subplans_or_initplans(Node *clause);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 25a7303..8640567 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -73,6 +73,9 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7a4940c..3b97b73 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..6cd4479 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,15 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer);
+extern bool expr_is_parallel_safe(Node *node);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..5492ba0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1199,6 +1199,8 @@ OverrideStackEntry
 PACE_HEADER
 PACL
 ParallelExecutorInfo
+PartialSeqScan
+PartialSeqScanState
 PATH
 PBOOL
 PCtxtHandle
#427Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#425)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 3:35 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Considering parallelism at RelOptInfo level in the way as done in patch,
won't consider the RelOptInfo's for child relations in case of Append node.
Refer build_simple_rel().

Hmm, true, but what can go wrong there? The same quals apply to both,
and either both are temp or neither is.

Also for cases when parallelism is not enabled like max_parallel_degree = 0,
the current way of doing could add an overhead of traversing the
baserestrictinfo without need. I think one way to avoid that would be check
that while setting parallelModeOK flag.

Good idea.

Another point is that it will consider parallelism for cases where we really
can't parallelize example for foreign table, sample scan.

As soon as we add the ability to push joins below Gather nodes, we
will be able to parallelize that stuff if it is joined to something we
can parallelize. That's why this flag is so handy.

One thing to note here is that we already have precedent of verifying qual
push down safety while path generation (during subquery path generation),
so it doesn't seem wrong to consider the same for parallel paths and it
would
minimize the cases where we need to evaluate parallelism.

Mmm, yeah.

The advantage of this is that the logic is centralized. If we have
parallel seq scan and also, say, parallel bitmap heap scan, your
approach would require that we duplicate the logic to check for
parallel-restricted functions for each path generation function.

Don't we anyway need that irrespective of caching it in RelOptInfo?
During bitmappath creation, bitmapqual could contain something
which needs to be evaluated for parallel-safety as it is built based
on index paths which inturn can be based on some join clause. As per
patch, the join clause parallel-safety is checked much later than
generation bitmappath.

Yes, it's possible there could be some additional checks needed here
for parameterized paths. But we're not quite there yet, so I think we
can solve that problem when we get there. I have it in mind that in
the future we may want a parallel_safe flag on each path, which would
normally match the consider_parallel flag on the RelOptInfo but could
instead be false if the path internally uses parallelism (since,
currently, Gather nodes cannot be nested) or if it's got
parallel-restricted parameterized quals. However, that seems like
future work.

+ else if (IsA(node, SubPlan) || IsA(node, SubLink) ||
+ IsA(node, AlternativeSubPlan) || IsA(node, Param))
+ {
+ /*
+ * Since we don't have the ability to push subplans down to workers
+ * at present, we treat subplan references as parallel-restricted.
+ */
+ if (!context->allow_restricted)
+ return true;
+ }

I think it is better to do this for PARAM_EXEC paramkind, as those are
the cases where it would be subplan or initplan.

Right, OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#428Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#427)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 23, 2015 at 3:35 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Considering parallelism at RelOptInfo level in the way as done in patch,
won't consider the RelOptInfo's for child relations in case of Append

node.

Refer build_simple_rel().

Hmm, true, but what can go wrong there? The same quals apply to both,
and either both are temp or neither is.

The base rel's consider_parallel flag won't be percolated to childrels, so
even
if we mark base rel as parallel capable, while generating the path it won't
be considered. I think we need to find a way to pass on that information if
we want to follow this way.

The advantage of this is that the logic is centralized. If we have
parallel seq scan and also, say, parallel bitmap heap scan, your
approach would require that we duplicate the logic to check for
parallel-restricted functions for each path generation function.

Don't we anyway need that irrespective of caching it in RelOptInfo?
During bitmappath creation, bitmapqual could contain something
which needs to be evaluated for parallel-safety as it is built based
on index paths which inturn can be based on some join clause. As per
patch, the join clause parallel-safety is checked much later than
generation bitmappath.

Yes, it's possible there could be some additional checks needed here
for parameterized paths. But we're not quite there yet, so I think we
can solve that problem when we get there. I have it in mind that in
the future we may want a parallel_safe flag on each path, which would
normally match the consider_parallel flag on the RelOptInfo but could
instead be false if the path internally uses parallelism (since,
currently, Gather nodes cannot be nested) or if it's got
parallel-restricted parameterized quals. However, that seems like
future work.

True, we can do that way. What I was trying to convey by above is
that we anyway need checks during path creation atleast in some
of the cases, so why not do all the checks at that time only as I
think all the information will be available at that time.

I think if we store parallelism related info in RelOptInfo, that can also
be made to work, but the only worry I have with that approach is we
need to have checks at two levels one at RelOptInfo formation time
and other at Path formation time.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#429Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#421)
Re: Parallel Seq Scan

On Thu, Oct 22, 2015 at 11:59:58PM -0400, Robert Haas wrote:

On Thu, Oct 15, 2015 at 8:23 PM, Noah Misch <noah@leadboat.com> wrote:

Agreed. More specifically, I had in mind for copyParamList() to check the
mask while e.g. ExecEvalParamExtern() would either check nothing or merely
assert that any mask included the requested parameter. It would be tricky to
verify that as safe, so ...

Would it work to define this as "if non-NULL,
params lacking a 1-bit may be safely ignored"? Or some other tweak
that basically says that you don't need to care about this, but you
can if you want to.

... this is a better specification.

Here's an attempt to implement that.

Since that specification permits ParamListInfo consumers to ignore paramMask,
the plpgsql_param_fetch() change from copy-paramlistinfo-fixes.patch is still
formally required.

@@ -50,6 +51,7 @@ copyParamList(ParamListInfo from)
retval->parserSetup = NULL;
retval->parserSetupArg = NULL;
retval->numParams = from->numParams;
+	retval->paramMask = bms_copy(from->paramMask);

Considering that this function squashes the masked params, I wonder if it
should just store NULL here.

for (i = 0; i < from->numParams; i++)
{
@@ -58,6 +60,20 @@ copyParamList(ParamListInfo from)
int16 typLen;
bool typByVal;

+		/*
+		 * Ignore parameters we don't need, to save cycles and space, and
+		 * in case the fetch hook might fail.
+		 */
+		if (retval->paramMask != NULL &&
+			!bms_is_member(i, retval->paramMask))

The "and in case the fetch hook might fail" in this comment and its clones is
contrary to the above specification. Under that specification, it would be a
bug in the ParamListInfo producer to rely on consumers checking paramMask.
Saving cycles/space would be the spec-approved paramMask use.

Consider adding an XXX comment to the effect that cursors ought to stop using
unshared param lists. The leading comment at setup_unshared_param_list() is a
good home for such an addition.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#430Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#429)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 9:38 PM, Noah Misch <noah@leadboat.com> wrote:

Since that specification permits ParamListInfo consumers to ignore paramMask,
the plpgsql_param_fetch() change from copy-paramlistinfo-fixes.patch is still
formally required.

So why am I not just doing that, then? Seems a lot more surgical.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#431Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#430)
Re: Parallel Seq Scan

On Sat, Oct 24, 2015 at 07:49:07AM -0400, Robert Haas wrote:

On Fri, Oct 23, 2015 at 9:38 PM, Noah Misch <noah@leadboat.com> wrote:

Since that specification permits ParamListInfo consumers to ignore paramMask,
the plpgsql_param_fetch() change from copy-paramlistinfo-fixes.patch is still
formally required.

So why am I not just doing that, then? Seems a lot more surgical.

do $$
declare
param_unused text := repeat('a', 100 * 1024 * 1024);
param_used oid := 403;
begin
perform count(*) from pg_am where oid = param_used;
end
$$;

I expect that if you were to inspect the EstimateParamListSpace() return
values when executing that, you would find that it serializes the irrelevant
100 MiB datum. No possible logic in plpgsql_param_fetch() could stop that
from happening, because copyParamList() and SerializeParamList() call the
paramFetch hook only for dynamic parameters. Cursors faced the same problem,
which is the raison d'�tre for setup_unshared_param_list().

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#432Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#431)
1 attachment(s)
Re: Parallel Seq Scan

On Sat, Oct 24, 2015 at 6:31 PM, Noah Misch <noah@leadboat.com> wrote:

On Sat, Oct 24, 2015 at 07:49:07AM -0400, Robert Haas wrote:

On Fri, Oct 23, 2015 at 9:38 PM, Noah Misch <noah@leadboat.com> wrote:

Since that specification permits ParamListInfo consumers to ignore paramMask,
the plpgsql_param_fetch() change from copy-paramlistinfo-fixes.patch is still
formally required.

So why am I not just doing that, then? Seems a lot more surgical.

do $$
declare
param_unused text := repeat('a', 100 * 1024 * 1024);
param_used oid := 403;
begin
perform count(*) from pg_am where oid = param_used;
end
$$;

I expect that if you were to inspect the EstimateParamListSpace() return
values when executing that, you would find that it serializes the irrelevant
100 MiB datum. No possible logic in plpgsql_param_fetch() could stop that
from happening, because copyParamList() and SerializeParamList() call the
paramFetch hook only for dynamic parameters. Cursors faced the same problem,
which is the raison d'être for setup_unshared_param_list().

Well, OK. That's not strictly a correctness issue, but here's an
updated patch along the lines you suggested.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

serialize-paramlistinfo-fixes-v2.patchapplication/x-patch; name=serialize-paramlistinfo-fixes-v2.patchDownload
From 50895be5cdbb0fda41535be23700e5112585e1e3 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 22 Oct 2015 23:56:51 -0400
Subject: [PATCH 6/6] Fix problems with ParamListInfo serialization mechanism.

Commit d1b7c1ffe72e86932b5395f29e006c3f503bc53d introduced a mechanism
for serializing a ParamListInfo structure to be passed to a parallel
worker.  However, this mechanism failed to handle external expanded
values, as pointed out by Noah Misch.  Repair.

Moreover, plpgsql_param_fetch requires adjustment because the
serialization mechanism needs it to skip evaluating unused parameters
just as we would do when it is called from copyParamList, but params
== estate->paramLI in that case.  To fix, make the bms_is_member test
in that function unconditional.

Finally, have setup_param_list set a new ParamListInfo field,
paramMask, to the parameters actually used in the expression, so that
we don't try to fetch those that are not needed when serializing a
parameter list.  This isn't necessary for performance, but it makes
the performance of the parallel executor code comparable to what we
do for cases involving cursors.
---
 src/backend/commands/prepare.c   |  1 +
 src/backend/executor/functions.c |  1 +
 src/backend/executor/spi.c       |  1 +
 src/backend/nodes/params.c       | 54 ++++++++++++++++++++++++++++++++--------
 src/backend/tcop/postgres.c      |  1 +
 src/backend/utils/adt/datum.c    | 16 ++++++++++++
 src/include/nodes/params.h       |  4 ++-
 src/pl/plpgsql/src/pl_exec.c     | 40 ++++++++++++++++-------------
 8 files changed, 89 insertions(+), 29 deletions(-)

diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index fb33d30..0d4aa69 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -392,6 +392,7 @@ EvaluateParams(PreparedStatement *pstmt, List *params,
 	paramLI->parserSetup = NULL;
 	paramLI->parserSetupArg = NULL;
 	paramLI->numParams = num_params;
+	paramLI->paramMask = NULL;
 
 	i = 0;
 	foreach(l, exprstates)
diff --git a/src/backend/executor/functions.c b/src/backend/executor/functions.c
index 812a610..0919c04 100644
--- a/src/backend/executor/functions.c
+++ b/src/backend/executor/functions.c
@@ -910,6 +910,7 @@ postquel_sub_params(SQLFunctionCachePtr fcache,
 			paramLI->parserSetup = NULL;
 			paramLI->parserSetupArg = NULL;
 			paramLI->numParams = nargs;
+			paramLI->paramMask = NULL;
 			fcache->paramLI = paramLI;
 		}
 		else
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 300401e..13ddb8f 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -2330,6 +2330,7 @@ _SPI_convert_params(int nargs, Oid *argtypes,
 		paramLI->parserSetup = NULL;
 		paramLI->parserSetupArg = NULL;
 		paramLI->numParams = nargs;
+		paramLI->paramMask = NULL;
 
 		for (i = 0; i < nargs; i++)
 		{
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index d093263..0351774 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -15,6 +15,7 @@
 
 #include "postgres.h"
 
+#include "nodes/bitmapset.h"
 #include "nodes/params.h"
 #include "storage/shmem.h"
 #include "utils/datum.h"
@@ -50,6 +51,7 @@ copyParamList(ParamListInfo from)
 	retval->parserSetup = NULL;
 	retval->parserSetupArg = NULL;
 	retval->numParams = from->numParams;
+	retval->paramMask = NULL;
 
 	for (i = 0; i < from->numParams; i++)
 	{
@@ -58,6 +60,17 @@ copyParamList(ParamListInfo from)
 		int16		typLen;
 		bool		typByVal;
 
+		/* Ignore parameters we don't need, to save cycles and space. */
+		if (retval->paramMask != NULL &&
+			!bms_is_member(i, retval->paramMask))
+		{
+			nprm->value = (Datum) 0;
+			nprm->isnull = true;
+			nprm->pflags = 0;
+			nprm->ptype = InvalidOid;
+			continue;
+		}
+
 		/* give hook a chance in case parameter is dynamic */
 		if (!OidIsValid(oprm->ptype) && from->paramFetch != NULL)
 			(*from->paramFetch) (from, i + 1);
@@ -90,19 +103,28 @@ EstimateParamListSpace(ParamListInfo paramLI)
 	for (i = 0; i < paramLI->numParams; i++)
 	{
 		ParamExternData *prm = &paramLI->params[i];
+		Oid			typeOid;
 		int16		typLen;
 		bool		typByVal;
 
-		/* give hook a chance in case parameter is dynamic */
-		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
-			(*paramLI->paramFetch) (paramLI, i + 1);
+		/* Ignore parameters we don't need, to save cycles and space. */
+		if (paramLI->paramMask != NULL &&
+			!bms_is_member(i, paramLI->paramMask))
+			typeOid = InvalidOid;
+		else
+		{
+			/* give hook a chance in case parameter is dynamic */
+			if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+				(*paramLI->paramFetch) (paramLI, i + 1);
+			typeOid = prm->ptype;
+		}
 
 		sz = add_size(sz, sizeof(Oid));			/* space for type OID */
 		sz = add_size(sz, sizeof(uint16));		/* space for pflags */
 
 		/* space for datum/isnull */
-		if (OidIsValid(prm->ptype))
-			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		if (OidIsValid(typeOid))
+			get_typlenbyval(typeOid, &typLen, &typByVal);
 		else
 		{
 			/* If no type OID, assume by-value, like copyParamList does. */
@@ -150,15 +172,24 @@ SerializeParamList(ParamListInfo paramLI, char **start_address)
 	for (i = 0; i < nparams; i++)
 	{
 		ParamExternData *prm = &paramLI->params[i];
+		Oid			typeOid;
 		int16		typLen;
 		bool		typByVal;
 
-		/* give hook a chance in case parameter is dynamic */
-		if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
-			(*paramLI->paramFetch) (paramLI, i + 1);
+		/* Ignore parameters we don't need, to save cycles and space. */
+		if (paramLI->paramMask != NULL &&
+			!bms_is_member(i, paramLI->paramMask))
+			typeOid = InvalidOid;
+		else
+		{
+			/* give hook a chance in case parameter is dynamic */
+			if (!OidIsValid(prm->ptype) && paramLI->paramFetch != NULL)
+				(*paramLI->paramFetch) (paramLI, i + 1);
+			typeOid = prm->ptype;
+		}
 
 		/* Write type OID. */
-		memcpy(*start_address, &prm->ptype, sizeof(Oid));
+		memcpy(*start_address, &typeOid, sizeof(Oid));
 		*start_address += sizeof(Oid);
 
 		/* Write flags. */
@@ -166,8 +197,8 @@ SerializeParamList(ParamListInfo paramLI, char **start_address)
 		*start_address += sizeof(uint16);
 
 		/* Write datum/isnull. */
-		if (OidIsValid(prm->ptype))
-			get_typlenbyval(prm->ptype, &typLen, &typByVal);
+		if (OidIsValid(typeOid))
+			get_typlenbyval(typeOid, &typLen, &typByVal);
 		else
 		{
 			/* If no type OID, assume by-value, like copyParamList does. */
@@ -209,6 +240,7 @@ RestoreParamList(char **start_address)
 	paramLI->parserSetup = NULL;
 	paramLI->parserSetupArg = NULL;
 	paramLI->numParams = nparams;
+	paramLI->paramMask = NULL;
 
 	for (i = 0; i < nparams; i++)
 	{
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d30fe35..f11a715 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1629,6 +1629,7 @@ exec_bind_message(StringInfo input_message)
 		params->parserSetup = NULL;
 		params->parserSetupArg = NULL;
 		params->numParams = numParams;
+		params->paramMask = NULL;
 
 		for (paramno = 0; paramno < numParams; paramno++)
 		{
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 3d9e354..0d61950 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -264,6 +264,11 @@ datumEstimateSpace(Datum value, bool isnull, bool typByVal, int typLen)
 		/* no need to use add_size, can't overflow */
 		if (typByVal)
 			sz += sizeof(Datum);
+		else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+		{
+			ExpandedObjectHeader *eoh = DatumGetEOHP(value);
+			sz += EOH_get_flat_size(eoh);
+		}
 		else
 			sz += datumGetSize(value, typByVal, typLen);
 	}
@@ -292,6 +297,7 @@ void
 datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 			   char **start_address)
 {
+	ExpandedObjectHeader *eoh = NULL;
 	int		header;
 
 	/* Write header word. */
@@ -299,6 +305,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 		header = -2;
 	else if (typByVal)
 		header = -1;
+	else if (VARATT_IS_EXTERNAL_EXPANDED(value))
+	{
+		eoh = DatumGetEOHP(value);
+		header = EOH_get_flat_size(eoh);
+	}
 	else
 		header = datumGetSize(value, typByVal, typLen);
 	memcpy(*start_address, &header, sizeof(int));
@@ -312,6 +323,11 @@ datumSerialize(Datum value, bool isnull, bool typByVal, int typLen,
 			memcpy(*start_address, &value, sizeof(Datum));
 			*start_address += sizeof(Datum);
 		}
+		else if (eoh)
+		{
+			EOH_flatten_into(eoh, (void *) *start_address, header);
+			*start_address += header;
+		}
 		else
 		{
 			memcpy(*start_address, DatumGetPointer(value), header);
diff --git a/src/include/nodes/params.h b/src/include/nodes/params.h
index 83bebde..2beae5f 100644
--- a/src/include/nodes/params.h
+++ b/src/include/nodes/params.h
@@ -14,7 +14,8 @@
 #ifndef PARAMS_H
 #define PARAMS_H
 
-/* To avoid including a pile of parser headers, reference ParseState thus: */
+/* Forward declarations, to avoid including other headers */
+struct Bitmapset;
 struct ParseState;
 
 
@@ -71,6 +72,7 @@ typedef struct ParamListInfoData
 	ParserSetupHook parserSetup;	/* parser setup hook */
 	void	   *parserSetupArg;
 	int			numParams;		/* number of ParamExternDatas following */
+	struct Bitmapset *paramMask; /* if non-NULL, can ignore omitted params */
 	ParamExternData params[FLEXIBLE_ARRAY_MEMBER];
 }	ParamListInfoData;
 
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index c73f20b..0b82e65 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -3287,6 +3287,7 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
 	estate->paramLI->parserSetup = (ParserSetupHook) plpgsql_parser_setup;
 	estate->paramLI->parserSetupArg = NULL;		/* filled during use */
 	estate->paramLI->numParams = estate->ndatums;
+	estate->paramLI->paramMask = NULL;
 	estate->params_dirty = false;
 
 	/* set up for use of appropriate simple-expression EState and cast hash */
@@ -5559,6 +5560,12 @@ setup_param_list(PLpgSQL_execstate *estate, PLpgSQL_expr *expr)
 		paramLI->parserSetupArg = (void *) expr;
 
 		/*
+		 * Allow parameters that aren't needed by this expression to be
+		 * ignored.
+		 */
+		paramLI->paramMask = expr->paramnos;
+
+		/*
 		 * Also make sure this is set before parser hooks need it.  There is
 		 * no need to save and restore, since the value is always correct once
 		 * set.  (Should be set already, but let's be sure.)
@@ -5592,6 +5599,9 @@ setup_param_list(PLpgSQL_execstate *estate, PLpgSQL_expr *expr)
  * shared param list, where it could get passed to some less-trusted function.
  *
  * Caller should pfree the result after use, if it's not NULL.
+ *
+ * XXX. Could we use ParamListInfo's new paramMask to avoid creating unshared
+ * parameter lists?
  */
 static ParamListInfo
 setup_unshared_param_list(PLpgSQL_execstate *estate, PLpgSQL_expr *expr)
@@ -5623,6 +5633,7 @@ setup_unshared_param_list(PLpgSQL_execstate *estate, PLpgSQL_expr *expr)
 		paramLI->parserSetup = (ParserSetupHook) plpgsql_parser_setup;
 		paramLI->parserSetupArg = (void *) expr;
 		paramLI->numParams = estate->ndatums;
+		paramLI->paramMask = NULL;
 
 		/*
 		 * Instantiate values for "safe" parameters of the expression.  We
@@ -5696,25 +5707,20 @@ plpgsql_param_fetch(ParamListInfo params, int paramid)
 	/* now we can access the target datum */
 	datum = estate->datums[dno];
 
-	/* need to behave slightly differently for shared and unshared arrays */
-	if (params != estate->paramLI)
-	{
-		/*
-		 * We're being called, presumably from copyParamList(), for cursor
-		 * parameters.  Since copyParamList() will try to materialize every
-		 * single parameter slot, it's important to do nothing when asked for
-		 * a datum that's not supposed to be used by this SQL expression.
-		 * Otherwise we risk failures in exec_eval_datum(), not to mention
-		 * possibly copying a lot more data than the cursor actually uses.
-		 */
-		if (!bms_is_member(dno, expr->paramnos))
-			return;
-	}
-	else
+	/*
+	 * Since copyParamList() or SerializeParamList() will try to materialize
+	 * every single parameter slot, it's important to do nothing when asked
+	 * for a datum that's not supposed to be used by this SQL expression.
+	 * Otherwise we risk failures in exec_eval_datum(), or copying a lot more
+	 * data than necessary.
+	 */
+	if (!bms_is_member(dno, expr->paramnos))
+		return;
+
+	if (params == estate->paramLI)
 	{
 		/*
-		 * Normal evaluation cases.  We don't need to sanity-check dno, but we
-		 * do need to mark the shared params array dirty if we're about to
+		 * We need to mark the shared params array dirty if we're about to
 		 * evaluate a resettable datum.
 		 */
 		switch (datum->dtype)
-- 
2.3.8 (Apple Git-58)

#433Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#432)
Re: Parallel Seq Scan

On Wed, Oct 28, 2015 at 01:04:12AM +0100, Robert Haas wrote:

Well, OK. That's not strictly a correctness issue, but here's an
updated patch along the lines you suggested.

Finally, have setup_param_list set a new ParamListInfo field,
paramMask, to the parameters actually used in the expression, so that
we don't try to fetch those that are not needed when serializing a
parameter list. This isn't necessary for performance, but it makes

s/performance/correctness/

the performance of the parallel executor code comparable to what we
do for cases involving cursors.

With that, the patch is ready.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#434Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#433)
Re: Parallel Seq Scan

On Fri, Oct 30, 2015 at 11:12 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Oct 28, 2015 at 01:04:12AM +0100, Robert Haas wrote:

Well, OK. That's not strictly a correctness issue, but here's an
updated patch along the lines you suggested.

Finally, have setup_param_list set a new ParamListInfo field,
paramMask, to the parameters actually used in the expression, so that
we don't try to fetch those that are not needed when serializing a
parameter list. This isn't necessary for performance, but it makes

s/performance/correctness/

the performance of the parallel executor code comparable to what we
do for cases involving cursors.

With that, the patch is ready.

Thanks, committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#435Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#426)
1 attachment(s)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Oct 23, 2015 at 10:33 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

Please find the rebased partial seq scan patch attached with this
mail.

Robert suggested me off list that we should once try to see if we
can use Seq Scan node instead of introducing a new Partial Seq Scan
node. I have analyzed to see if we can use the SeqScan node (containing
parallel flag) instead of introducing new partial seq scan and found that
we primarily need to change most of the functions in nodeSeqScan.c to
have a parallel flag check and do something special for Partial Seq Scan
and apart from that we need special handling in function
ExecSupportsBackwardScan(). In general, I think we can make
SeqScan node parallel-aware by having some special paths without
introducing much complexity and that can save us code-duplication
between nodeSeqScan.c and nodePartialSeqScan.c. One thing that makes
me slightly uncomfortable with this approach is that for partial seq scan,
currently the plan looks like:

QUERY PLAN
--------------------------------------------------------------------------
Gather (cost=0.00..2588194.25 rows=9990667 width=4)
Number of Workers: 1
-> Partial Seq Scan on t1 (cost=0.00..89527.51 rows=9990667 width=4)
Filter: (c1 > 10000)
(4 rows)

Now instead of displaying Partial Seq Scan, if we just display Seq Scan,
then it might confuse user, so it is better to add some thing indicating
parallel node if we want to go this route.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_partialseqscan_v24.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v24.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..d03fbde 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -729,6 +729,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -850,6 +851,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_PartialSeqScan:
+			pname = sname = "Partial Seq Scan";
+			break;
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
@@ -1005,6 +1009,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_BitmapHeapScan:
 		case T_TidScan:
@@ -1270,6 +1275,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 							 planstate, ancestors, es);
 			/* FALL THRU to print additional fields the same as SeqScan */
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_ValuesScan:
 		case T_CteScan:
 		case T_WorkTableScan:
@@ -2353,6 +2359,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..38a92fe 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -21,8 +21,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
        nodeHash.o nodeHashjoin.o nodeIndexscan.o nodeIndexonlyscan.o \
        nodeLimit.o nodeLockRows.o \
        nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
-       nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
-       nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
+       nodeNestloop.o nodeFunctionscan.o nodePartialSeqscan.o nodeRecursiveunion.o \
+       nodeResult.o nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
        nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
        nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
        nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 163650c..b3d041c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -38,6 +38,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
 #include "executor/nodeSamplescan.h"
@@ -157,6 +158,10 @@ ExecReScan(PlanState *node)
 			ExecReScanSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecReScanPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecReScanSampleScan((SampleScanState *) node);
 			break;
@@ -468,6 +473,9 @@ ExecSupportsBackwardScan(Plan *node)
 		case T_CteScan:
 			return TargetListSupportsBackwardScan(node->targetlist);
 
+		case T_PartialSeqScan:
+			return false;
+
 		case T_SampleScan:
 			/* Simplify life for tablesample methods by disallowing this */
 			return false;
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index bcd287f..6e05598 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -261,6 +261,7 @@ search_plan_tree(PlanState *node, Oid table_oid)
 			 * Relation scan nodes can all be treated alike
 			 */
 		case T_SeqScanState:
+		case T_PartialSeqScanState:
 		case T_SampleScanState:
 		case T_IndexScanState:
 		case T_IndexOnlyScanState:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 99a9de3..6bb3ab2 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -25,6 +25,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -167,10 +168,16 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	/* Count this node. */
 	e->nnodes++;
 
-	/*
-	 * XXX. Call estimators for parallel-aware nodes here, when we have
-	 * some.
-	 */
+	/* Call estimators for parallel-aware nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanEstimate((PartialSeqScanState *) planstate,
+									   e->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
@@ -205,10 +212,16 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/*
-	 * XXX. Call initializers for parallel-aware plan nodes, when we have
-	 * some.
-	 */
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanInitializeDSM((PartialSeqScanState *) planstate,
+											d->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
 }
@@ -575,6 +588,32 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Initialize the PlanState and it's descendents with the information
+ * retrieved from shared memory.  This has to be done once the PlanState
+ * is allocated and initialized by executor for each node aka after
+ * ExecutorStart().
+ */
+static bool
+ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
+{
+	if (planstate == NULL)
+		return false;
+
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_PartialSeqScanState:
+			ExecPartialSeqScanInitParallelScanDesc((PartialSeqScanState *) planstate,
+												   toc);
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
@@ -610,6 +649,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 
 	/* Start up the executor, have it run the plan, and then shut it down. */
 	ExecutorStart(queryDesc, 0);
+	ExecParallelInitializeWorker(queryDesc->planstate, toc);
 	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
 	ExecutorFinish(queryDesc);
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 6f5c554..1b929c2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -100,6 +100,7 @@
 #include "executor/nodeMergejoin.h"
 #include "executor/nodeModifyTable.h"
 #include "executor/nodeNestloop.h"
+#include "executor/nodePartialSeqscan.h"
 #include "executor/nodeGather.h"
 #include "executor/nodeRecursiveunion.h"
 #include "executor/nodeResult.h"
@@ -193,6 +194,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 												   estate, eflags);
 			break;
 
+		case T_PartialSeqScan:
+			result = (PlanState *) ExecInitPartialSeqScan((PartialSeqScan *) node,
+														  estate, eflags);
+			break;
+
 		case T_SampleScan:
 			result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
 													  estate, eflags);
@@ -419,6 +425,10 @@ ExecProcNode(PlanState *node)
 			result = ExecSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			result = ExecPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			result = ExecSampleScan((SampleScanState *) node);
 			break;
@@ -665,6 +675,10 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		case T_PartialSeqScanState:
+			ExecEndPartialSeqScan((PartialSeqScanState *) node);
+			break;
+
 		case T_SampleScanState:
 			ExecEndSampleScan((SampleScanState *) node);
 			break;
diff --git a/src/backend/executor/nodePartialSeqscan.c b/src/backend/executor/nodePartialSeqscan.c
new file mode 100644
index 0000000..bc37b9b
--- /dev/null
+++ b/src/backend/executor/nodePartialSeqscan.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.c
+ *	  Support routines for partial sequential scans of relations.
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/nodePartialSeqscan.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ *		ExecPartialSeqScan				scans a relation partially.
+ *		PartialSeqNext					retrieve next tuple from heap.
+ *		ExecInitPartialSeqScan			creates and initializes a partial seqscan node.
+ *		ExecEndPartialSeqScan			releases any storage allocated.
+ */
+#include "postgres.h"
+
+#include "access/relscan.h"
+#include "executor/execdebug.h"
+#include "executor/execParallel.h"
+#include "executor/nodePartialSeqscan.h"
+#include "utils/rel.h"
+
+
+
+/* ----------------------------------------------------------------
+ *						Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		PartialSeqNext
+ *
+ *		This is a workhorse for ExecPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+PartialSeqNext(PartialSeqScanState *node)
+{
+	HeapTuple	tuple;
+	HeapScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	/*
+	 * get the next tuple from the table
+	 */
+	tuple = heap_getnext(scandesc, direction);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+	 * not created with palloc() and so should not be pfree()'d.  Note also
+	 * that ExecStoreTuple will increment the refcount of the buffer; the
+	 * refcount will not be dropped until the tuple table slot is cleared.
+	 */
+	if (tuple)
+		ExecStoreTuple(tuple,	/* tuple to store */
+					   slot,	/* slot to store in */
+					   scandesc->rs_cbuf,		/* buffer associated with this
+												 * tuple */
+					   false);	/* don't pfree this pointer */
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * PartialSeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+PartialSeqRecheck(PartialSeqScanState *node, TupleTableSlot *slot)
+{
+	/*
+	 * Note that unlike IndexScan, PartialSeqScan never use keys in
+	 * heap_beginscan (and this is very bad) - so, here we do not check are
+	 * keys ok or not.
+	 */
+	return true;
+}
+
+/* ----------------------------------------------------------------
+ *		InitPartialScanRelation
+ *
+ *		Set up to access the scan relation.
+ * ----------------------------------------------------------------
+ */
+static void
+InitPartialScanRelation(PartialSeqScanState *node, EState *estate, int eflags)
+{
+	Relation	currentRelation;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table,
+	 * open that relation and acquire appropriate lock on it.
+	 */
+	currentRelation = ExecOpenScanRelation(estate,
+									  ((Scan *) node->ss.ps.plan)->scanrelid,
+										   eflags);
+
+	node->ss.ss_currentRelation = currentRelation;
+
+	/* and report the scan tuple slot's rowtype */
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanEstimate
+ *
+ *		estimates the space required to serialize partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->pscan_len = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+
+	/* key for partial scan information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitializeDSM
+ *
+ *		Initialize the DSM with the contents required to perform
+ *		partial seqscan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Store parallel heap scan descriptor in dynamic shared memory.
+	 */
+	node->pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+	heap_parallelscan_initialize(node->pscan,
+								 node->ss.ss_currentRelation,
+								 estate->es_snapshot);
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id,
+				   node->pscan);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitParallelDesc
+ *
+ *		Retrieve the contents from DSM related to partial seq scan node
+ *		and initialize the partial seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc)
+{
+	node->pscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecInitPartialSeqScan
+ * ----------------------------------------------------------------
+ */
+PartialSeqScanState *
+ExecInitPartialSeqScan(PartialSeqScan *node, EState *estate, int eflags)
+{
+	PartialSeqScanState *scanstate;
+
+	/*
+	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
+	 * not any more.
+	 */
+	Assert(outerPlan(node) == NULL);
+	Assert(innerPlan(node) == NULL);
+
+	/*
+	 * create state structure
+	 */
+	scanstate = makeNode(PartialSeqScanState);
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
+
+	/*
+	 * Miscellaneous initialization
+	 *
+	 * create expression context for node
+	 */
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
+
+	/*
+	 * initialize child expressions
+	 */
+	scanstate->ss.ps.targetlist = (List *)
+		ExecInitExpr((Expr *) node->plan.targetlist,
+					 (PlanState *) scanstate);
+	scanstate->ss.ps.qual = (List *)
+		ExecInitExpr((Expr *) node->plan.qual,
+					 (PlanState *) scanstate);
+
+	/*
+	 * tuple table initialization
+	 */
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
+
+	/*
+	 * initialize scan relation
+	 */
+	InitPartialScanRelation(scanstate, estate, eflags);
+
+	scanstate->ss.ps.ps_TupFromTlist = false;
+
+	/*
+	 * Initialize result tuple type and projection info.
+	 */
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
+
+	return scanstate;
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScan(node)
+ *
+ *		Scans the relation and returns the next qualifying tuple.
+ *		We call the ExecScan() routine and pass it the appropriate
+ *		access method functions.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecPartialSeqScan(PartialSeqScanState *node)
+{
+	/*
+	 * Initialize the scan on first execution, normally we initialize it
+	 * during ExecutorStart phase, however we need ParallelHeapScanDesc to
+	 * initialize the scan in case of this node and the same is initialized by
+	 * the Gather node during ExecutorRun phase.
+	 */
+	if (!node->ss.ss_currentScanDesc)
+	{
+		node->ss.ss_currentScanDesc =
+			heap_beginscan_parallel(node->ss.ss_currentRelation, node->pscan);
+	}
+
+	return ExecScan((ScanState *) node,
+					(ExecScanAccessMtd) PartialSeqNext,
+					(ExecScanRecheckMtd) PartialSeqRecheck);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecEndPartialSeqScan
+ *
+ *		frees any storage allocated through C routines.
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndPartialSeqScan(PartialSeqScanState *node)
+{
+	Relation	relation;
+	HeapScanDesc scanDesc;
+
+	/*
+	 * get information from node
+	 */
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
+
+	/*
+	 * Free the exprcontext
+	 */
+	ExecFreeExprContext(&node->ss.ps);
+
+	/*
+	 * clean out the tuple table
+	 */
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+	/*
+	 * close heap scan
+	 */
+	if (scanDesc)
+		heap_endscan(scanDesc);
+
+	/*
+	 * close the heap relation.
+	 */
+	ExecCloseScanRelation(relation);
+}
+
+/* ----------------------------------------------------------------
+ *						Join Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecReScanPartialSeqScan
+ *
+ *		Rescans the relation.
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanPartialSeqScan(PartialSeqScanState *node)
+{
+	HeapScanDesc scan;
+
+	scan = node->ss.ss_currentScanDesc;
+
+	if (scan)
+		heap_rescan(scan,		/* scan desc */
+					NULL);		/* new scan keys */
+
+	ExecScanReScan((ScanState *) node);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c176ff9..fdcccef 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -384,6 +384,22 @@ _copySeqScan(const SeqScan *from)
 }
 
 /*
+ * _copyPartialSeqScan
+ */
+static PartialSeqScan *
+_copyPartialSeqScan(const SeqScan *from)
+{
+	PartialSeqScan    *newnode = makeNode(PartialSeqScan);
+
+	/*
+	 * copy node superclass fields
+	 */
+	CopyScanFields((const Scan *) from, (Scan *) newnode);
+
+	return newnode;
+}
+
+/*
  * _copySampleScan
  */
 static SampleScan *
@@ -4264,6 +4280,9 @@ copyObject(const void *from)
 		case T_SeqScan:
 			retval = _copySeqScan(from);
 			break;
+		case T_PartialSeqScan:
+			retval = _copyPartialSeqScan(from);
+			break;
 		case T_SampleScan:
 			retval = _copySampleScan(from);
 			break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3e75cd1..5ce45e2 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -460,6 +460,14 @@ _outSeqScan(StringInfo str, const SeqScan *node)
 }
 
 static void
+_outPartialSeqScan(StringInfo str, const SeqScan *node)
+{
+	WRITE_NODE_TYPE("PARTIALSEQSCAN");
+
+	_outScanInfo(str, (const Scan *) node);
+}
+
+static void
 _outSampleScan(StringInfo str, const SampleScan *node)
 {
 	WRITE_NODE_TYPE("SAMPLESCAN");
@@ -3020,6 +3028,9 @@ _outNode(StringInfo str, const void *obj)
 			case T_SeqScan:
 				_outSeqScan(str, obj);
 				break;
+			case T_PartialSeqScan:
+				_outPartialSeqScan(str, obj);
+				break;
 			case T_SampleScan:
 				_outSampleScan(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 94ba6dc..3d3448d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1606,6 +1606,19 @@ _readSeqScan(void)
 }
 
 /*
+ * _readPartialSeqScan
+ */
+static PartialSeqScan *
+_readPartialSeqScan(void)
+{
+	READ_LOCALS_NO_FIELDS(PartialSeqScan);
+
+	ReadCommonScan(local_node);
+
+	READ_DONE();
+}
+
+/*
  * _readSampleScan
  */
 static SampleScan *
@@ -2337,6 +2350,8 @@ parseNodeString(void)
 		return_value = _readScan();
 	else if (MATCH("SEQSCAN", 7))
 		return_value = _readSeqScan();
+	else if (MATCH("PARTIALSEQSCAN", 14))
+		return_value = _readPartialSeqScan();
 	else if (MATCH("SAMPLESCAN", 10))
 		return_value = _readSampleScan();
 	else if (MATCH("INDEXSCAN", 9))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..c2ae95d 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -477,6 +477,9 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	/* Consider sequential scan */
 	add_path(rel, create_seqscan_path(root, rel, required_outer));
 
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
+
 	/* Consider index scans */
 	create_index_paths(root, rel);
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1b61fd9..3239cec 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -227,6 +227,49 @@ cost_seqscan(Path *path, PlannerInfo *root,
 }
 
 /*
+ * cost_partialseqscan
+ *	  Determines and returns the cost of scanning a relation partially.
+ *
+ * 'baserel' is the relation to be scanned
+ * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed
+ */
+void
+cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers)
+{
+	Cost		startup_cost = 0;
+	Cost		run_cost = 0;
+
+	cost_seqscan(path, root, baserel, param_info);
+
+	startup_cost = path->startup_cost;
+
+	run_cost = path->total_cost - startup_cost;
+
+	/*
+	 * Account for small cost for communication related to scan via the
+	 * ParallelHeapScanDesc.
+	 */
+	run_cost += 0.01;
+
+	/*
+	 * Runtime cost will be equally shared by all workers. Here assumption is
+	 * that disk access cost will also be equally shared between workers which
+	 * is generally true unless there are too many workers working on a
+	 * relatively lesser number of blocks.  If we come across any such case,
+	 * then we can think of changing the current cost model for partial
+	 * sequiantial scan.
+	 */
+	run_cost = run_cost / (nworkers + 1);
+
+	path->startup_cost = startup_cost;
+	path->total_cost = startup_cost + run_cost;
+}
+
+/*
  * cost_samplescan
  *	  Determines and returns the cost of scanning a relation using sampling.
  *
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..ce25cbf
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * expr_is_parallel_safe
+ *	  is a paraticular expression parallel safe
+ *
+ * Conditions checked here:
+ *
+ * 1. The expresion must not contain any parallel unsafe or parallel
+ * restricted functions.
+ *
+ * 2. The expression must not contain any initplan or subplan.  We can
+ * probably remove this restriction once we have support of infrastructure
+ * for execution of initplans and subplans at parallel (Gather) nodes.
+ */
+bool
+expr_is_parallel_safe(Node *node)
+{
+	if (check_parallel_safety(node, false))
+		return false;
+
+	if (contain_subplans_or_initplans(node))
+		return false;
+
+	return true;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+	ListCell   *l;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (max_parallel_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * Allow parallel paths only if all the clauses for relation are parallel
+	 * safe.  We can allow execution of parallel restricted clauses in master
+	 * backend, but for that planner should have infrastructure to pull all
+	 * the parallel restricted clauses from below nodes to the Gather node
+	 * which will then execute such clauses in master backend.
+	 */
+	foreach(l, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (!expr_is_parallel_safe((Node *) rinfo->clause))
+			return;
+	}
+
+	num_parallel_workers = Min(max_parallel_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_partialseqscan_path(root, rel, required_outer,
+										 num_parallel_workers);
+
+	/* Create the gather path which master backend needs to execute. */
+	add_path(rel, (Path *) create_gather_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 791b64e..f860580 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -58,6 +58,8 @@ static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
 static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
 static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
 					List *tlist, List *scan_clauses);
+static Scan *create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses);
 static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
 					   List *tlist, List *scan_clauses);
 static Gather *create_gather_plan(PlannerInfo *root,
@@ -104,6 +106,8 @@ static List *order_qual_clauses(PlannerInfo *root, List *clauses);
 static void copy_path_costsize(Plan *dest, Path *src);
 static void copy_plan_costsize(Plan *dest, Plan *src);
 static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+static PartialSeqScan *make_partialseqscan(List *qptlist, List *qpqual,
+					Index scanrelid);
 static SampleScan *make_samplescan(List *qptlist, List *qpqual, Index scanrelid,
 				TableSampleClause *tsc);
 static Gather *make_gather(List *qptlist, List *qpqual,
@@ -237,6 +241,7 @@ create_plan_recurse(PlannerInfo *root, Path *best_path)
 	switch (best_path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -357,6 +362,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path)
 												scan_clauses);
 			break;
 
+		case T_PartialSeqScan:
+			plan = (Plan *) create_partialseqscan_plan(root,
+													   best_path,
+													   tlist,
+													   scan_clauses);
+			break;
+
 		case T_SampleScan:
 			plan = (Plan *) create_samplescan_plan(root,
 												   best_path,
@@ -567,6 +579,7 @@ disuse_physical_tlist(PlannerInfo *root, Plan *plan, Path *path)
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 		case T_SampleScan:
 		case T_IndexScan:
 		case T_IndexOnlyScan:
@@ -1184,6 +1197,46 @@ create_seqscan_plan(PlannerInfo *root, Path *best_path,
 }
 
 /*
+ * create_partialseqscan_plan
+ *
+ * Returns a partial seqscan plan for the base relation scanned by
+ * 'best_path' with restriction clauses 'scan_clauses' and targetlist
+ * 'tlist'.
+ */
+static Scan *
+create_partialseqscan_plan(PlannerInfo *root, Path *best_path,
+						   List *tlist, List *scan_clauses)
+{
+	Scan	   *scan_plan;
+	Index		scan_relid = best_path->parent->relid;
+
+	/* it should be a base rel... */
+	Assert(scan_relid > 0);
+	Assert(best_path->parent->rtekind == RTE_RELATION);
+
+	/* Sort clauses into best execution order */
+	scan_clauses = order_qual_clauses(root, scan_clauses);
+
+	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+	scan_clauses = extract_actual_clauses(scan_clauses, false);
+
+	/* Replace any outer-relation variables with nestloop params */
+	if (best_path->param_info)
+	{
+		scan_clauses = (List *)
+			replace_nestloop_params(root, (Node *) scan_clauses);
+	}
+
+	scan_plan = (Scan *) make_partialseqscan(tlist,
+											 scan_clauses,
+											 scan_relid);
+
+	copy_path_costsize(&scan_plan->plan, best_path);
+
+	return scan_plan;
+}
+
+/*
  * create_samplescan_plan
  *	 Returns a samplescan plan for the base relation scanned by 'best_path'
  *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
@@ -3481,6 +3534,24 @@ make_seqscan(List *qptlist,
 	return node;
 }
 
+static PartialSeqScan *
+make_partialseqscan(List *qptlist,
+					List *qpqual,
+					Index scanrelid)
+{
+	PartialSeqScan *node = makeNode(PartialSeqScan);
+	Plan	   *plan = &node->plan;
+
+	/* cost should be inserted by caller */
+	plan->targetlist = qptlist;
+	plan->qual = qpqual;
+	plan->lefttree = NULL;
+	plan->righttree = NULL;
+	node->scanrelid = scanrelid;
+
+	return node;
+}
+
 static SampleScan *
 make_samplescan(List *qptlist,
 				List *qpqual,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 536b55e..d5329fb 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -202,13 +202,13 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->hasRowSecurity = false;
 
 	/*
-	 * Assess whether it's feasible to use parallel mode for this query.
-	 * We can't do this in a standalone backend, or if the command will
-	 * try to modify any data, or if this is a cursor operation, or if any
+	 * Assess whether it's feasible to use parallel mode for this query. We
+	 * can't do this in a standalone backend, or if the command will try to
+	 * modify any data, or if this is a cursor operation, or if any
 	 * parallel-unsafe functions are present in the query tree.
 	 *
-	 * For now, we don't try to use parallel mode if we're running inside
-	 * a parallel worker.  We might eventually be able to relax this
+	 * For now, we don't try to use parallel mode if we're running inside a
+	 * parallel worker.  We might eventually be able to relax this
 	 * restriction, but for now it seems best not to have parallel workers
 	 * trying to create their own parallel workers.
 	 *
@@ -225,7 +225,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
 		parse->utilityStmt == NULL && !IsParallelWorker() &&
 		!IsolationIsSerializable() &&
-		!contain_parallel_unsafe((Node *) parse);
+		!check_parallel_safety((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
@@ -238,9 +238,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 *
 	 * (It's been suggested that we should always impose these restrictions
 	 * whenever glob->parallelModeOK is true, so that it's easier to notice
-	 * incorrectly-labeled functions sooner.  That might be the right thing
-	 * to do, but for now I've taken this approach.  We could also control
-	 * this with a GUC.)
+	 * incorrectly-labeled functions sooner.  That might be the right thing to
+	 * do, but for now I've taken this approach.  We could also control this
+	 * with a GUC.)
 	 *
 	 * FIXME: It's assumed that code further down will set parallelModeNeeded
 	 * to true if a parallel path is actually chosen.  Since the core
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 48d6e6f..293e735 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -447,6 +447,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 	switch (nodeTag(plan))
 	{
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			{
 				SeqScan    *splan = (SeqScan *) plan;
 
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 82414d4..99dacde 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2234,6 +2234,7 @@ finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
 			break;
 
 		case T_SeqScan:
+		case T_PartialSeqScan:
 			context.paramids = bms_add_members(context.paramids, scan_params);
 			break;
 
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..2355cc6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -87,16 +87,25 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+}	check_parallel_safety_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
 static bool find_window_functions_walker(Node *node, WindowFuncLists *lists);
 static bool expression_returns_set_rows_walker(Node *node, double *count);
 static bool contain_subplans_walker(Node *node, void *context);
+static bool contain_subplans_or_initplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool check_parallel_safety_walker(Node *node,
+							 check_parallel_safety_arg * context);
+static bool parallel_too_dangerous(char proparallel,
+					   check_parallel_safety_arg * context);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1204,13 +1213,16 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
  *****************************************************************************/
 
 bool
-contain_parallel_unsafe(Node *node)
+check_parallel_safety(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	check_parallel_safety_arg context;
+
+	context.allow_restricted = allow_restricted;
+	return check_parallel_safety_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+check_parallel_safety_walker(Node *node, check_parallel_safety_arg * context)
 {
 	if (node == NULL)
 		return false;
@@ -1218,7 +1230,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1227,7 +1239,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1236,7 +1248,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1245,7 +1257,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1254,7 +1266,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1268,12 +1280,12 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1282,7 +1294,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1294,28 +1306,77 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			if (parallel_too_dangerous(op_volatile(lfirst_oid(opid)), context))
 				return true;
 		}
 		/* else fall through to check args */
 	}
 	else if (IsA(node, Query))
 	{
-		Query *query = (Query *) node;
+		Query	   *query = (Query *) node;
 
 		if (query->rowMarks != NULL)
 			return true;
 
 		/* Recurse into subselects */
 		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
+								 check_parallel_safety_walker,
 								 context, 0);
 	}
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  check_parallel_safety_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, check_parallel_safety_arg * context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+/*
+ * contain_subplans_or_initplans
+ *	  Recursively search for initplan or subplan nodes within a clause.
+ *
+ * A special purpose function for prohibiting subplan or initplan clauses
+ * in parallel query constructs.
+ *
+ * If we see any form of SubPlan node, we will return TRUE.  For InitPlan's,
+ * we return true when we see the Param node, apart from that InitPlan
+ * can contain a simple NULL constant for MULTIEXPR subquery (see comments
+ * in make_subplan), however it is okay not to care about the same as that
+ * is only possible for Update statement which is anyway prohibited.
+ *
+ * Returns true if any subplan or initplan is found.
+ */
+bool
+contain_subplans_or_initplans(Node *clause)
+{
+	return contain_subplans_or_initplans_walker(clause, NULL);
+}
+
+static bool
+contain_subplans_or_initplans_walker(Node *node, void *context)
+{
+	if (node == NULL)
+		return false;
+	if (IsA(node, SubPlan) ||
+		IsA(node, AlternativeSubPlan) ||
+		IsA(node, SubLink))
+		return true;			/* abort the tree traversal and return true */
+	else if (IsA(node, Param))
+	{
+		Param	   *paramval = (Param *) node;
+
+		if (paramval->paramkind == PARAM_EXEC)
+			return true;
+	}
+	return expression_tree_walker(node, contain_subplans_or_initplans_walker, context);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1895a68..2fd7ae5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -712,6 +712,28 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 }
 
 /*
+ * create_partialseqscan_path
+ *	  Creates a path corresponding to a partial sequential scan, returning the
+ *	  pathnode.
+ */
+Path *
+create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers)
+{
+	Path	   *pathnode = makeNode(Path);
+
+	pathnode->pathtype = T_PartialSeqScan;
+	pathnode->parent = rel;
+	pathnode->param_info = get_baserel_parampathinfo(root, rel,
+													 required_outer);
+	pathnode->pathkeys = NIL;	/* partialseqscan has unordered result */
+
+	cost_partialseqscan(pathnode, root, rel, pathnode->param_info, nworkers);
+
+	return pathnode;
+}
+
+/*
  * create_samplescan_path
  *	  Creates a path node for a sampled table scan.
  */
diff --git a/src/include/executor/nodePartialSeqscan.h b/src/include/executor/nodePartialSeqscan.h
new file mode 100644
index 0000000..77e5311
--- /dev/null
+++ b/src/include/executor/nodePartialSeqscan.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodePartialSeqscan.h
+ *		prototypes for nodePartialSeqscan.c
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodePartialSeqscan.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEPARTIALSEQSCAN_H
+#define NODEPARTIALSEQSCAN_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecPartialSeqScanEstimate(PartialSeqScanState *node,
+						   ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitializeDSM(PartialSeqScanState *node,
+								ParallelContext *pcxt);
+extern void ExecPartialSeqScanInitParallelScanDesc(PartialSeqScanState *node,
+									   shm_toc *toc);
+extern PartialSeqScanState *ExecInitPartialSeqScan(PartialSeqScan *node,
+					   EState *estate, int eflags);
+extern TupleTableSlot *ExecPartialSeqScan(PartialSeqScanState *node);
+extern void ExecEndPartialSeqScan(PartialSeqScanState *node);
+extern void ExecReScanPartialSeqScan(PartialSeqScanState *node);
+
+#endif   /* NODEPARTIALSEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 939bc0e..d50caf3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
@@ -1254,6 +1255,18 @@ typedef struct ScanState
  */
 typedef ScanState SeqScanState;
 
+/*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelHeapScanDesc	pscan;	/* parallel heap scan descriptor
+									 * for partial scan */
+	Size		pscan_len;		/* size of parallel heap scan descriptor */
+} PartialSeqScanState;
+
 /* ----------------
  *	 SampleScanState information
  * ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 94bdb7c..71496b9 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -51,6 +51,7 @@ typedef enum NodeTag
 	T_BitmapOr,
 	T_Scan,
 	T_SeqScan,
+	T_PartialSeqScan,
 	T_SampleScan,
 	T_IndexScan,
 	T_IndexOnlyScan,
@@ -99,6 +100,7 @@ typedef enum NodeTag
 	T_BitmapOrState,
 	T_ScanState,
 	T_SeqScanState,
+	T_PartialSeqScanState,
 	T_SampleScanState,
 	T_IndexScanState,
 	T_IndexOnlyScanState,
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6b28c8e..bb41b8e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -72,7 +72,7 @@ typedef struct PlannedStmt
 
 	bool		hasRowSecurity; /* row security applied? */
 
-	bool		parallelModeNeeded; /* parallel mode required to execute? */
+	bool		parallelModeNeeded;		/* parallel mode required to execute? */
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -287,6 +287,12 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
  *		table sample scan node
  * ----------------
  */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..747b05b 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,8 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool check_parallel_safety(Node *node, bool allow_restricted);
+extern bool contain_subplans_or_initplans(Node *clause);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 25a7303..8640567 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -73,6 +73,9 @@ extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 			 ParamPathInfo *param_info);
+extern void cost_partialseqscan(Path *path, PlannerInfo *root,
+					RelOptInfo *baserel, ParamPathInfo *param_info,
+					int nworkers);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7a4940c..3b97b73 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -32,6 +32,8 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
 					Relids required_outer);
+extern Path *create_partialseqscan_path(PlannerInfo *root, RelOptInfo *rel,
+						   Relids required_outer, int nworkers);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..6cd4479 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,15 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer);
+extern bool expr_is_parallel_safe(Node *node);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index feb821b..5492ba0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1199,6 +1199,8 @@ OverrideStackEntry
 PACE_HEADER
 PACL
 ParallelExecutorInfo
+PartialSeqScan
+PartialSeqScanState
 PATH
 PBOOL
 PCtxtHandle
#436Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Amit Kapila (#435)
Re: Parallel Seq Scan

On Tue, Nov 3, 2015 at 9:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 23, 2015 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Oct 23, 2015 at 10:33 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

Please find the rebased partial seq scan patch attached with this
mail.

Robert suggested me off list that we should once try to see if we
can use Seq Scan node instead of introducing a new Partial Seq Scan
node. I have analyzed to see if we can use the SeqScan node (containing
parallel flag) instead of introducing new partial seq scan and found that
we primarily need to change most of the functions in nodeSeqScan.c to
have a parallel flag check and do something special for Partial Seq Scan
and apart from that we need special handling in function
ExecSupportsBackwardScan(). In general, I think we can make
SeqScan node parallel-aware by having some special paths without
introducing much complexity and that can save us code-duplication
between nodeSeqScan.c and nodePartialSeqScan.c. One thing that makes
me slightly uncomfortable with this approach is that for partial seq scan,
currently the plan looks like:

QUERY PLAN
--------------------------------------------------------------------------
Gather (cost=0.00..2588194.25 rows=9990667 width=4)
Number of Workers: 1
-> Partial Seq Scan on t1 (cost=0.00..89527.51 rows=9990667 width=4)
Filter: (c1 > 10000)
(4 rows)

Now instead of displaying Partial Seq Scan, if we just display Seq Scan,
then it might confuse user, so it is better to add some thing indicating
parallel node if we want to go this route.

IMO, the change from Partial Seq Scan to Seq Scan may not confuse user,
if we clearly specify in the documentation that all plans under a Gather node
are parallel plans.

This is possible for the execution nodes that executes fully under a
Gather node.
The same is not possible for parallel aggregates, so we have to mention the
aggregate node below Gather node as partial only.

I feel this suggestion arises as may be because of some duplicate code between
Partial Seq Scan and Seq scan. By using Seq Scan node only if we display as
Partial Seq Scan by storing some flag in the plan? This avoids the
need of adding
new plan nodes.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#437Robert Haas
robertmhaas@gmail.com
In reply to: Haribabu Kommi (#436)
Re: Parallel Seq Scan

On Thu, Nov 5, 2015 at 12:52 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

Now instead of displaying Partial Seq Scan, if we just display Seq Scan,
then it might confuse user, so it is better to add some thing indicating
parallel node if we want to go this route.

IMO, the change from Partial Seq Scan to Seq Scan may not confuse user,
if we clearly specify in the documentation that all plans under a Gather node
are parallel plans.

This is possible for the execution nodes that executes fully under a
Gather node.
The same is not possible for parallel aggregates, so we have to mention the
aggregate node below Gather node as partial only.

I feel this suggestion arises as may be because of some duplicate code between
Partial Seq Scan and Seq scan. By using Seq Scan node only if we display as
Partial Seq Scan by storing some flag in the plan? This avoids the
need of adding
new plan nodes.

I was thinking about this idea:

1. Add a parallel_aware flag to each plan.

2. If the flag is set, have EXPLAIN print the word "Parallel" before
the node name.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#438Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#437)
Re: Parallel Seq Scan

On Thu, Nov 5, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I was thinking about this idea:

1. Add a parallel_aware flag to each plan.

Okay, so shall we add it in generic Plan node or to specific plan nodes
like SeqScan, IndexScan, etc. To me, it appears that parallelism is
a node specific property, so we should add it to specific nodes and
for now as we are parallelising seq scan, so we can add this flag in
SeqScan node. What do you say?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#439Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#438)
Re: Parallel Seq Scan

On Thu, Nov 5, 2015 at 10:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 5, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I was thinking about this idea:

1. Add a parallel_aware flag to each plan.

Okay, so shall we add it in generic Plan node or to specific plan nodes
like SeqScan, IndexScan, etc. To me, it appears that parallelism is
a node specific property, so we should add it to specific nodes and
for now as we are parallelising seq scan, so we can add this flag in
SeqScan node. What do you say?

I think it should go in the Plan node itself. Parallel Append is
going to need a way to test whether a node is parallel-aware, and
there's nothing simpler than if (plan->parallel_aware). That makes
life simple for EXPLAIN, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#440Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#428)
1 attachment(s)
Re: Parallel Seq Scan

On Fri, Oct 23, 2015 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The base rel's consider_parallel flag won't be percolated to childrels, so
even
if we mark base rel as parallel capable, while generating the path it won't
be considered. I think we need to find a way to pass on that information if
we want to follow this way.

Fixed in the attached version. I added a max_parallel_degree check,
too, per your suggestion.

True, we can do that way. What I was trying to convey by above is
that we anyway need checks during path creation atleast in some
of the cases, so why not do all the checks at that time only as I
think all the information will be available at that time.

I think if we store parallelism related info in RelOptInfo, that can also
be made to work, but the only worry I have with that approach is we
need to have checks at two levels one at RelOptInfo formation time
and other at Path formation time.

I don't really see that as a problem. What I'm thinking about doing
(but it's not implemented in the attached patch) is additionally
adding a ppi_consider_parallel flag to the ParamPathInfo. This would
be meaningful only for baserels, and would indicate whether the
ParamPathInfo's ppi_clauses are parallel-safe.

If we're thinking about adding a parallel path to a baserel, we need
the RelOptInfo to have consider_parallel set and, if there is a
ParamPathInfo, we need the ParamPathInfo's ppi_consider_parallel flag
to be set also. That shows that both the rel's baserestrictinfo and
the paramaterized join clauses are parallel-safe. For a joinrel, we
can add a path if (1) the joinrel has consider_parallel set and (2)
the paths being joined are parallel-safe. Testing condition (2) will
require a per-Path flag, so we'll end up with one flag in the
RelOptInfo, a second in the ParamPathInfo, and a third in the Path.
That doesn't seem like a problem, though: it's a sign that we're doing
this in a way that fits into the existing infrastructure, and it
should be pretty efficient.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

consider-parallel-v2.patchtext/x-diff; charset=US-ASCII; name=consider-parallel-v2.patchDownload
From e31d5f3f4c53d80e87a74925db88bcaf2e6fa564 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 2 Oct 2015 23:57:46 -0400
Subject: [PATCH 2/7] Strengthen planner infrastructure for parallelism.

Add a new flag, consider_parallel, to each RelOptInfo, indicating
whether a plan for that relation could conceivably be run inside of
a parallel worker.  Right now, we're pretty conservative: for example,
it might be possible to defer applying a parallel-restricted qual
in a worker, and later do it in the leader, but right now we just
don't try to parallelize access to that relation.  That's probably
the right decision in most cases, anyway.
---
 src/backend/nodes/outfuncs.c          |   1 +
 src/backend/optimizer/path/allpaths.c | 155 +++++++++++++++++++++++++++-
 src/backend/optimizer/plan/planmain.c |  12 +++
 src/backend/optimizer/plan/planner.c  |   9 +-
 src/backend/optimizer/util/clauses.c  | 183 +++++++++++++++++++++++++++-------
 src/backend/optimizer/util/relnode.c  |  21 ++++
 src/backend/utils/cache/lsyscache.c   |  22 ++++
 src/include/nodes/relation.h          |   1 +
 src/include/optimizer/clauses.h       |   2 +-
 src/include/utils/lsyscache.h         |   1 +
 10 files changed, 364 insertions(+), 43 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3e75cd1..0030a9b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1868,6 +1868,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_INT_FIELD(width);
 	WRITE_BOOL_FIELD(consider_startup);
 	WRITE_BOOL_FIELD(consider_param_startup);
+	WRITE_BOOL_FIELD(consider_parallel);
 	WRITE_NODE_FIELD(reltargetlist);
 	WRITE_NODE_FIELD(pathlist);
 	WRITE_NODE_FIELD(ppilist);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..105e544 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -21,6 +21,7 @@
 #include "access/tsmapi.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_operator.h"
+#include "catalog/pg_proc.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -71,6 +72,9 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 				 Index rti, RangeTblEntry *rte);
 static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel,
 				   RangeTblEntry *rte);
+static void set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
+						  RangeTblEntry *rte);
+static bool function_rte_parallel_ok(RangeTblEntry *rte);
 static void set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 					   RangeTblEntry *rte);
 static void set_tablesample_rel_size(PlannerInfo *root, RelOptInfo *rel,
@@ -158,7 +162,8 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	set_base_rel_consider_startup(root);
 
 	/*
-	 * Generate access paths for the base rels.
+	 * Generate access paths for the base rels.  set_base_rel_sizes also
+	 * sets the consider_parallel flag for each baserel, if appropriate.
 	 */
 	set_base_rel_sizes(root);
 	set_base_rel_pathlists(root);
@@ -222,9 +227,12 @@ set_base_rel_consider_startup(PlannerInfo *root)
 /*
  * set_base_rel_sizes
  *	  Set the size estimates (rows and widths) for each base-relation entry.
+ *    Also determine whether to consider parallel paths for base relations.
  *
  * We do this in a separate pass over the base rels so that rowcount
- * estimates are available for parameterized path generation.
+ * estimates are available for parameterized path generation, and also so
+ * that the consider_parallel flag is set correctly before we begin to
+ * generate paths.
  */
 static void
 set_base_rel_sizes(PlannerInfo *root)
@@ -234,6 +242,7 @@ set_base_rel_sizes(PlannerInfo *root)
 	for (rti = 1; rti < root->simple_rel_array_size; rti++)
 	{
 		RelOptInfo *rel = root->simple_rel_array[rti];
+		RangeTblEntry *rte;
 
 		/* there may be empty slots corresponding to non-baserel RTEs */
 		if (rel == NULL)
@@ -245,7 +254,19 @@ set_base_rel_sizes(PlannerInfo *root)
 		if (rel->reloptkind != RELOPT_BASEREL)
 			continue;
 
-		set_rel_size(root, rel, rti, root->simple_rte_array[rti]);
+		rte = root->simple_rte_array[rti];
+
+		/*
+		 * If parallelism is allowable for this query in general, see whether
+		 * it's allowable for this rel in particular.  We have to do this
+		 * before set_rel_size, because that if this is an inheritance parent,
+		 * set_append_rel_size will pass the consider_parallel flag down to
+		 * inheritance children.
+		 */
+		if (root->glob->parallelModeOK)
+			set_rel_consider_parallel(root, rel, rte);
+
+		set_rel_size(root, rel, rti, rte);
 	}
 }
 
@@ -459,6 +480,131 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 }
 
 /*
+ * If this relation could possibly be scanned from within a worker, then set
+ * the consider_parallel flag.  The flag has previously been initialized to
+ * false, so we just bail out if it becomes clear that we can't safely set it.
+ */
+static void
+set_rel_consider_parallel(PlannerInfo *root, RelOptInfo *rel,
+						  RangeTblEntry *rte)
+{
+	/* Don't call this if parallelism is disallowed for the entire query. */
+	Assert(root->glob->parallelModeOK);
+
+	/* Don't call this for non-baserels. */
+	Assert(rel->reloptkind == RELOPT_BASEREL);
+
+	/* Assorted checks based on rtekind. */
+	switch (rte->rtekind)
+	{
+		case RTE_RELATION:
+			/*
+			 * Currently, parallel workers can't access the leader's temporary
+			 * tables.  We could possibly relax this if the wrote all of its
+			 * local buffers at the start of the query and made no changes
+			 * thereafter (maybe we could allow hint bit changes), and if we
+			 * taught the workers to read them.  Writing a large number of
+			 * temporary buffers could be expensive, though, and we don't have
+			 * the rest of the necessary infrastructure right now anyway.  So
+			 * for now, bail out if we see a temporary table.
+			 */
+			if (get_rel_persistence(rte->relid) == RELPERSISTENCE_TEMP)
+				return;
+
+			/*
+			 * Table sampling can be pushed down to workers if the sample
+			 * function and its arguments are safe.
+			 */
+			if (rte->tablesample != NULL)
+			{
+				Oid	proparallel = func_parallel(rte->tablesample->tsmhandler);
+
+				if (proparallel != PROPARALLEL_SAFE)
+					return;
+				if (has_parallel_hazard((Node *) rte->tablesample->args,
+										false))
+					return;
+				return;
+			}
+			break;
+
+		case RTE_SUBQUERY:
+			/*
+			 * Subplans currently aren't passed to workers.  Even if they
+			 * were, the subplan might be using parallelism internally, and
+			 * we can't support nested Gather nodes at present.  Finally,
+			 * we don't have a good way of knowing whether the subplan
+			 * involves any parallel-restricted operations.  It would be
+			 * nice to relax this restriction some day, but it's going to
+			 * take a fair amount of work.
+			 */
+			return;
+
+		case RTE_JOIN:
+			/* Shouldn't happen; we're only considering baserels here. */
+			Assert(false);
+			return;
+
+		case RTE_FUNCTION:
+			/* Check for parallel-restricted functions. */
+			if (!function_rte_parallel_ok(rte))
+				return;
+			break;
+
+		case RTE_VALUES:
+			/*
+			 * The data for a VALUES clause is stored in the plan tree itself,
+			 * so scanning it in a worker is fine.
+			 */
+			break;
+
+		case RTE_CTE:
+			/*
+			 * CTE tuplestores aren't shared among parallel workers, so we
+			 * force all CTE scans to happen in the leader.  Also, populating
+			 * the CTE would require executing a subplan that's not available
+			 * in the worker, might be parallel-restricted, and must get
+			 * executed only once.
+			 */
+			return;
+	}
+
+	/*
+	 * If there's anything in baserestrictinfo that's parallel-restricted,
+	 * we give up on parallelizing access to this relation.  We could consider
+	 * instead postponing application of the restricted quals until we're
+	 * above all the parallelism in the plan tree, but it's not clear that
+	 * this would be a win in very many cases, and it might be tricky to make
+	 * outer join clauses work correctly.
+	 */
+	if (has_parallel_hazard((Node *) rel->baserestrictinfo, false))
+		return;
+
+	/* We have a winner. */
+	rel->consider_parallel = true;
+}
+
+/*
+ * Check whether a function RTE is scanning something parallel-restricted.
+ */
+static bool
+function_rte_parallel_ok(RangeTblEntry *rte)
+{
+	ListCell   *lc;
+
+	foreach(lc, rte->functions)
+	{
+		RangeTblFunction *rtfunc = (RangeTblFunction *) lfirst(lc);
+
+		Assert(IsA(rtfunc, RangeTblFunction));
+		if (has_parallel_hazard(rtfunc->funcexpr, false))
+			return false;
+	}
+
+	return true;
+}
+
+/*
  * set_plain_rel_pathlist
  *	  Build access paths for a plain relation (no subquery, no inheritance)
  */
@@ -714,6 +860,9 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			continue;
 		}
 
+		/* Copy consider_parallel flag from parent. */
+		childrel->consider_parallel = rel->consider_parallel;
+
 		/*
 		 * CE failed, so finish copying/modifying targetlist and join quals.
 		 *
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 848df97..d73e7c0 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -20,6 +20,7 @@
  */
 #include "postgres.h"
 
+#include "optimizer/clauses.h"
 #include "optimizer/orclauses.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -70,6 +71,17 @@ query_planner(PlannerInfo *root, List *tlist,
 		/* We need a dummy joinrel to describe the empty set of baserels */
 		final_rel = build_empty_join_rel(root);
 
+		/*
+		 * If query allows parallelism in general, check whether the quals
+		 * are parallel-restricted.  There's currently no real benefit to
+		 * setting this flag correctly because we can't yet reference subplans
+		 * from parallel workers.  But that might change someday, so set this
+		 * correctly anyway.
+		 */
+		if (root->glob->parallelModeOK)
+			final_rel->consider_parallel =
+				!has_parallel_hazard(parse->jointree->quals, false);
+
 		/* The only path for it is a trivial Result path */
 		add_path(final_rel, (Path *)
 				 create_result_path((List *) parse->jointree->quals));
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 569fe55..bddd7ba 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -239,7 +239,8 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	/*
 	 * Assess whether it's feasible to use parallel mode for this query.
 	 * We can't do this in a standalone backend, or if the command will
-	 * try to modify any data, or if this is a cursor operation, or if any
+	 * try to modify any data, or if this is a cursor operation, or if
+	 * GUCs are set to values that don't permit parallelism, or if
 	 * parallel-unsafe functions are present in the query tree.
 	 *
 	 * For now, we don't try to use parallel mode if we're running inside
@@ -258,9 +259,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->parallelModeOK = (cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
 		IsUnderPostmaster && dynamic_shared_memory_type != DSM_IMPL_NONE &&
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
-		parse->utilityStmt == NULL && !IsParallelWorker() &&
-		!IsolationIsSerializable() &&
-		!contain_parallel_unsafe((Node *) parse);
+		parse->utilityStmt == NULL && max_parallel_degree > 0 &&
+		!IsParallelWorker() && !IsolationIsSerializable() &&
+		!has_parallel_hazard((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..915c8a4 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -21,6 +21,7 @@
 
 #include "access/htup_details.h"
 #include "catalog/pg_aggregate.h"
+#include "catalog/pg_class.h"
 #include "catalog/pg_language.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_proc.h"
@@ -87,6 +88,11 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+} has_parallel_hazard_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
@@ -96,7 +102,11 @@ static bool contain_subplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool has_parallel_hazard_walker(Node *node,
+				has_parallel_hazard_arg *context);
+static bool parallel_too_dangerous(char proparallel,
+				has_parallel_hazard_arg *context);
+static bool typeid_is_temp(Oid typeid);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1200,63 +1210,159 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
 }
 
 /*****************************************************************************
- *		Check queries for parallel-unsafe constructs
+ *		Check queries for parallel unsafe and/or restricted constructs
  *****************************************************************************/
 
+/*
+ * Check whether a node tree contains parallel hazards.  This is used both
+ * on the entire query tree, to see whether the query can be parallelized at
+ * all, and also to evaluate whether a particular expression is safe to
+ * run in a parallel worker.  We could separate these concerns into two
+ * different functions, but there's enough overlap that it doesn't seem
+ * worthwhile.
+ */
 bool
-contain_parallel_unsafe(Node *node)
+has_parallel_hazard(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	has_parallel_hazard_arg	context;
+
+	context.allow_restricted = allow_restricted;
+	return has_parallel_hazard_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+has_parallel_hazard_walker(Node *node, has_parallel_hazard_arg *context)
 {
 	if (node == NULL)
 		return false;
+
+	/*
+	 * When we're first invoked on a completely unplanned tree, we must
+	 * recurse through Query objects to as to locate parallel-unsafe
+	 * constructs anywhere in the tree.
+	 *
+	 * Later, we'll be called again for specific quals, possibly after
+	 * some planning has been done, we may encounter SubPlan, SubLink,
+	 * or AlternativeSubLink nodes.  Currently, there's no need to recurse
+	 * through these; they can't be unsafe, since we've already cleared
+	 * the entire query of unsafe operations, and they're definitely
+	 * parallel-restricted.
+	 */
+	if (IsA(node, Query))
+	{
+		Query *query = (Query *) node;
+
+		if (query->rowMarks != NULL)
+			return true;
+
+		/* Recurse into subselects */
+		return query_tree_walker(query,
+								 has_parallel_hazard_walker,
+								 context, 0);
+	}
+	else if (IsA(node, SubPlan) || IsA(node, SubLink) ||
+			 IsA(node, AlternativeSubPlan) || IsA(node, Param))
+	{
+		/*
+		 * Since we don't have the ability to push subplans down to workers
+		 * at present, we treat subplan references as parallel-restricted.
+		 */
+		if (!context->allow_restricted)
+			return true;
+	}
+
+	/* This is just a notational convenience for callers. */
+	if (IsA(node, RestrictInfo))
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) node;
+		return has_parallel_hazard_walker((Node *) rinfo->clause, context);
+	}
+
+	/*
+	 * It is an error for a parallel worker to touch a temporary table in any
+	 * way, so we can't handle nodes whose type is the rowtype of such a table.
+	 */
+	if (!context->allow_restricted)
+	{
+		switch (nodeTag(node))
+		{
+			case T_Var:
+			case T_Const:
+			case T_Param:
+			case T_Aggref:
+			case T_WindowFunc:
+			case T_ArrayRef:
+			case T_FuncExpr:
+			case T_NamedArgExpr:
+			case T_OpExpr:
+			case T_DistinctExpr:
+			case T_NullIfExpr:
+			case T_FieldSelect:
+			case T_FieldStore:
+			case T_RelabelType:
+			case T_CoerceViaIO:
+			case T_ArrayCoerceExpr:
+			case T_ConvertRowtypeExpr:
+			case T_CaseExpr:
+			case T_CaseTestExpr:
+			case T_ArrayExpr:
+			case T_RowExpr:
+			case T_CoalesceExpr:
+			case T_MinMaxExpr:
+			case T_CoerceToDomain:
+			case T_CoerceToDomainValue:
+			case T_SetToDefault:
+				if (typeid_is_temp(exprType(node)))
+					return true;
+				break;
+			default:
+				break;
+		}
+	}
+
+	/*
+	 * For each node that might potentially call a function, we need to
+	 * examine the pg_proc.proparallel marking for that function to see
+	 * whether it's safe enough for the current value of allow_restricted.
+	 */
 	if (IsA(node, FuncExpr))
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, OpExpr))
 	{
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, DistinctExpr))
 	{
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, NullIfExpr))
 	{
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, ScalarArrayOpExpr))
 	{
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, CoerceViaIO))
 	{
@@ -1268,54 +1374,61 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, ArrayCoerceExpr))
 	{
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
-		/* else fall through to check args */
 	}
 	else if (IsA(node, RowCompareExpr))
 	{
-		/* RowCompare probably can't have volatile ops, but check anyway */
 		RowCompareExpr *rcexpr = (RowCompareExpr *) node;
 		ListCell   *opid;
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			Oid	opfuncid = get_opcode(lfirst_oid(opid));
+			if (parallel_too_dangerous(func_parallel(opfuncid), context))
 				return true;
 		}
-		/* else fall through to check args */
 	}
-	else if (IsA(node, Query))
-	{
-		Query *query = (Query *) node;
 
-		if (query->rowMarks != NULL)
-			return true;
-
-		/* Recurse into subselects */
-		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
-								 context, 0);
-	}
+	/* ... and recurse to check substructure */
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  has_parallel_hazard_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, has_parallel_hazard_arg *context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+static bool
+typeid_is_temp(Oid typeid)
+{
+	Oid				relid = get_typ_typrelid(typeid);
+
+	if (!OidIsValid(relid))
+		return false;
+
+	return (get_rel_persistence(relid) == RELPERSISTENCE_TEMP);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 68a93a1..996b7fe 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
@@ -102,6 +103,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
 	/* cheap startup cost is interesting iff not all tuples to be retrieved */
 	rel->consider_startup = (root->tuple_fraction > 0);
 	rel->consider_param_startup = false;		/* might get changed later */
+	rel->consider_parallel = false;				/* might get changed later */
 	rel->reltargetlist = NIL;
 	rel->pathlist = NIL;
 	rel->ppilist = NIL;
@@ -363,6 +365,7 @@ build_join_rel(PlannerInfo *root,
 	/* cheap startup cost is interesting iff not all tuples to be retrieved */
 	joinrel->consider_startup = (root->tuple_fraction > 0);
 	joinrel->consider_param_startup = false;
+	joinrel->consider_parallel = false;
 	joinrel->reltargetlist = NIL;
 	joinrel->pathlist = NIL;
 	joinrel->ppilist = NIL;
@@ -442,6 +445,24 @@ build_join_rel(PlannerInfo *root,
 							   sjinfo, restrictlist);
 
 	/*
+	 * Set the consider_parallel flag if this joinrel could potentially be
+	 * scanned within a parallel worker.  If this flag is false for either
+	 * inner_rel or outer_rel, then it must be false for the joinrel also.
+	 * Even if both are true, there might be parallel-restricted quals at
+	 * our level.
+	 *
+	 * Note that if there are more than two rels in this relation, they
+	 * could be divided between inner_rel and outer_rel in any arbitary
+	 * way.  We assume this doesn't matter, because we should hit all the
+	 * same baserels and joinclauses while building up to this joinrel no
+	 * matter which we take; therefore, we should make the same decision
+	 * here however we get here.
+	 */
+	if (inner_rel->consider_parallel && outer_rel->consider_parallel &&
+		!has_parallel_hazard((Node *) restrictlist, false))
+		joinrel->consider_parallel = true;
+
+	/*
 	 * Add the joinrel to the query's joinrel list, and store it into the
 	 * auxiliary hashtable if there is one.  NB: GEQO requires us to append
 	 * the new joinrel to the end of the list!
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 8d1cdf1..093da76 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -1787,6 +1787,28 @@ get_rel_tablespace(Oid relid)
 		return InvalidOid;
 }
 
+/*
+ * get_rel_persistence
+ *
+ *		Returns the relpersistence associated with a given relation.
+ */
+char
+get_rel_persistence(Oid relid)
+{
+	HeapTuple		tp;
+	Form_pg_class	reltup;
+	char 			result;
+
+	tp = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+	if (!HeapTupleIsValid(tp))
+		elog(ERROR, "cache lookup failed for relation %u", relid);
+	reltup = (Form_pg_class) GETSTRUCT(tp);
+	result = reltup->relpersistence;
+	ReleaseSysCache(tp);
+
+	return result;
+}
+
 
 /*				---------- TRANSFORM CACHE ----------						 */
 
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6cf2e24..41be9b1 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -452,6 +452,7 @@ typedef struct RelOptInfo
 	/* per-relation planner control flags */
 	bool		consider_startup;		/* keep cheap-startup-cost paths? */
 	bool		consider_param_startup; /* ditto, for parameterized paths? */
+	bool		consider_parallel;		/* consider parallel paths? */
 
 	/* materialization information */
 	List	   *reltargetlist;	/* Vars to be output by scan of relation */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..323f093 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,7 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool has_parallel_hazard(Node *node, bool allow_restricted);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 450d9fe..dcc421f 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -103,6 +103,7 @@ extern Oid	get_rel_namespace(Oid relid);
 extern Oid	get_rel_type_id(Oid relid);
 extern char get_rel_relkind(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
+extern char get_rel_persistence(Oid relid);
 extern Oid	get_transform_fromsql(Oid typid, Oid langid, List *trftypes);
 extern Oid	get_transform_tosql(Oid typid, Oid langid, List *trftypes);
 extern bool get_typisdefined(Oid typid);
-- 
2.3.8 (Apple Git-58)

#441Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#440)
Re: Parallel Seq Scan

On Sat, Nov 7, 2015 at 4:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 23, 2015 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

The base rel's consider_parallel flag won't be percolated to childrels,

so

even
if we mark base rel as parallel capable, while generating the path it

won't

be considered. I think we need to find a way to pass on that

information if

we want to follow this way.

Fixed in the attached version. I added a max_parallel_degree check,
too, per your suggestion.

+ else if (IsA(node, SubPlan) || IsA(node, SubLink) ||
+ IsA(node, AlternativeSubPlan) || IsA(node, Param))
+ {
+ /*
+ * Since we don't have the ability to push subplans down to workers
+ * at present, we treat subplan references as parallel-restricted.
+ */
+ if (!context->allow_restricted)
+ return true;
+ }

You seem to have agreed upthread to change this check for PARAM_EXEC
paramkind. I think you have forgotten to change this code.

@@ -714,6 +860,9 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
continue;
}

+ /* Copy consider_parallel flag from parent. */
+ childrel->consider_parallel = rel->consider_parallel;
+

We are changing the childrel quals in this function and in function
adjust_appendrel_attrs()->adjust_appendrel_attrs_mutator(), we are
adding an extra conversion step for one of the case to the Var
participating in qualification. So even though it might be safe to
assume that it won't add any new parallel-restricted or parallel-unsafe
expression in qual, ideally we should have a check for parallel-safety in
childrel quals separately as those might not be identical to parent rel.

True, we can do that way. What I was trying to convey by above is
that we anyway need checks during path creation atleast in some
of the cases, so why not do all the checks at that time only as I
think all the information will be available at that time.

I think if we store parallelism related info in RelOptInfo, that can

also

be made to work, but the only worry I have with that approach is we
need to have checks at two levels one at RelOptInfo formation time
and other at Path formation time.

I don't really see that as a problem. What I'm thinking about doing
(but it's not implemented in the attached patch) is additionally
adding a ppi_consider_parallel flag to the ParamPathInfo. This would
be meaningful only for baserels, and would indicate whether the
ParamPathInfo's ppi_clauses are parallel-safe.

If we're thinking about adding a parallel path to a baserel, we need
the RelOptInfo to have consider_parallel set and, if there is a
ParamPathInfo, we need the ParamPathInfo's ppi_consider_parallel flag
to be set also. That shows that both the rel's baserestrictinfo and
the paramaterized join clauses are parallel-safe. For a joinrel, we
can add a path if (1) the joinrel has consider_parallel set and (2)
the paths being joined are parallel-safe. Testing condition (2) will
require a per-Path flag, so we'll end up with one flag in the
RelOptInfo, a second in the ParamPathInfo, and a third in the Path.

I am already adding a parallel_aware flag in the patch to make seqscan
node parallel_aware, so if you find that patch better compare to partial
seq scan node patch, then I can take care of above in that patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#442Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#439)
1 attachment(s)
Re: Parallel Seq Scan

On Fri, Nov 6, 2015 at 10:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 5, 2015 at 10:49 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Nov 5, 2015 at 11:54 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I was thinking about this idea:

1. Add a parallel_aware flag to each plan.

Okay, so shall we add it in generic Plan node or to specific plan nodes
like SeqScan, IndexScan, etc. To me, it appears that parallelism is
a node specific property, so we should add it to specific nodes and
for now as we are parallelising seq scan, so we can add this flag in
SeqScan node. What do you say?

I think it should go in the Plan node itself. Parallel Append is
going to need a way to test whether a node is parallel-aware, and
there's nothing simpler than if (plan->parallel_aware). That makes
life simple for EXPLAIN, too.

Okay, I have updated the patch to make seq scan node parallel aware.
To make that happen we need to have parallel_aware flag both in Plan
as well as Path, so that we can pass that information from Path to Plan.
I think the right place to copy parallel_aware info from path to
plan is copy_path_costsize and ideally we should change the name
of function to something like copy_generic_path_info(), but for
now I have retained it's original name as it is used at many places,
let me know if you think we should goahead and change the name
of function as well.

I have changed Explain as well so that it adds Parallel for Seq Scan if
SeqScan node is parallel_aware.

I have not integrated it with consider-parallel patch, so that this and
Partial Seq Scan version of the patch can be compared without much
difficulity.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_seqscan_partialseqscan_v25.patchapplication/octet-stream; name=parallel_seqscan_partialseqscan_v25.patchDownload
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7fb8a14..3d158d1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -848,7 +848,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			sname = "Hash Join";
 			break;
 		case T_SeqScan:
-			pname = sname = "Seq Scan";
+			sname = "Seq Scan";
+			if (plan->parallel_aware)
+				pname = "Parallel Seq Scan";
+			else
+				pname = sname;
 			break;
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 163650c..580c654 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -462,6 +462,10 @@ ExecSupportsBackwardScan(Plan *node)
 			}
 
 		case T_SeqScan:
+			{
+				if (node->parallel_aware)
+					return false;
+			}
 		case T_TidScan:
 		case T_FunctionScan:
 		case T_ValuesScan:
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 99a9de3..c4954b6 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -25,6 +25,7 @@
 
 #include "executor/execParallel.h"
 #include "executor/executor.h"
+#include "executor/nodeSeqscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -167,10 +168,16 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 	/* Count this node. */
 	e->nnodes++;
 
-	/*
-	 * XXX. Call estimators for parallel-aware nodes here, when we have
-	 * some.
-	 */
+	/* Call estimators for parallel-aware nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_SeqScanState:
+			ExecSeqScanEstimate((SeqScanState *) planstate,
+								e->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelEstimate, e);
 }
@@ -205,10 +212,16 @@ ExecParallelInitializeDSM(PlanState *planstate,
 	/* Count this node. */
 	d->nnodes++;
 
-	/*
-	 * XXX. Call initializers for parallel-aware plan nodes, when we have
-	 * some.
-	 */
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_SeqScanState:
+			ExecSeqScanInitializeDSM((SeqScanState *) planstate,
+									 d->pcxt);
+			break;
+		default:
+			break;
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
 }
@@ -575,6 +588,32 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Initialize the PlanState and it's descendents with the information
+ * retrieved from shared memory.  This has to be done once the PlanState
+ * is allocated and initialized by executor for each node aka after
+ * ExecutorStart().
+ */
+static bool
+ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
+{
+	if (planstate == NULL)
+		return false;
+
+	/* Call initializers for parallel-aware plan nodes. */
+	switch (nodeTag(planstate))
+	{
+		case T_SeqScanState:
+			ExecSeqScanInitParallelScanDesc((SeqScanState *) planstate,
+											toc);
+			break;
+		default:
+			break;
+	}
+
+	return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);
+}
+
+/*
  * Main entrypoint for parallel query worker processes.
  *
  * We reach this function from ParallelMain, so the setup necessary to create
@@ -610,6 +649,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 
 	/* Start up the executor, have it run the plan, and then shut it down. */
 	ExecutorStart(queryDesc, 0);
+	ExecParallelInitializeWorker(queryDesc->planstate, toc);
 	ExecutorRun(queryDesc, ForwardScanDirection, 0L);
 	ExecutorFinish(queryDesc);
 
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..2e46745 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -1,11 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * nodeSeqscan.c
- *	  Support routines for sequential scans of relations.
+ *	  Support routines for sequential and parallel sequential scans
+ * of relations.
  *
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
+ * A seqscan node can be used to perform the sequential and parallel scans
+ * of relations.  Parallel aware seqscan node can be executed either in
+ * master backend or by a worker backend and it initializes the scan on
+ * first execution as it needs parallel heap scan descriptor which gets
+ * initialized during first execution of gather node.  The functionality
+ * to parallelize the scan is encapsulated in heap layer (refer heapam.c).
  *
  * IDENTIFICATION
  *	  src/backend/executor/nodeSeqscan.c
@@ -19,6 +26,9 @@
  *		ExecInitSeqScan			creates and initializes a seqscan node.
  *		ExecEndSeqScan			releases any storage allocated.
  *		ExecReScanSeqScan		rescans the relation
+ *		ExecSeqScanEstimate		estimates the space to serialize seqscan node
+ *		ExecSeqScanInitializeDSM Initializes the DSM to perform parallel scan
+ *		ExecSeqScanInitParallelScanDesc Initializes the parallel scan descriptor
  */
 #include "postgres.h"
 
@@ -53,10 +63,10 @@ SeqNext(SeqScanState *node)
 	/*
 	 * get information from the estate and scan state
 	 */
-	scandesc = node->ss_currentScanDesc;
-	estate = node->ps.state;
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
 	direction = estate->es_direction;
-	slot = node->ss_ScanTupleSlot;
+	slot = node->ss.ss_ScanTupleSlot;
 
 	/*
 	 * get the next tuple from the table
@@ -108,6 +118,18 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 TupleTableSlot *
 ExecSeqScan(SeqScanState *node)
 {
+	/*
+	 * For parallel aware scan, initialize the scan on first execution,
+	 * normally we initialize it during ExecutorStart phase, however we need
+	 * ParallelHeapScanDesc to initialize the scan in case of this node and
+	 * the same is initialized by the Gather node during ExecutorRun phase.
+	 */
+	if (node->ss.ps.plan->parallel_aware && !node->ss.ss_currentScanDesc)
+	{
+		node->ss.ss_currentScanDesc =
+			heap_beginscan_parallel(node->ss.ss_currentRelation, node->pscan);
+	}
+
 	return ExecScan((ScanState *) node,
 					(ExecScanAccessMtd) SeqNext,
 					(ExecScanRecheckMtd) SeqRecheck);
@@ -130,20 +152,23 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 	 * open that relation and acquire appropriate lock on it.
 	 */
 	currentRelation = ExecOpenScanRelation(estate,
-									  ((SeqScan *) node->ps.plan)->scanrelid,
+									  ((SeqScan *) node->ss.ps.plan)->scanrelid,
 										   eflags);
 
-	/* initialize a heapscan */
-	currentScanDesc = heap_beginscan(currentRelation,
-									 estate->es_snapshot,
-									 0,
-									 NULL);
+	/* initialize a heapscan for non-parallel aware scan */
+	if (!node->ss.ps.plan->parallel_aware)
+	{
+		currentScanDesc = heap_beginscan(currentRelation,
+										 estate->es_snapshot,
+										 0,
+										 NULL);
+		node->ss.ss_currentScanDesc = currentScanDesc;
+	}
 
-	node->ss_currentRelation = currentRelation;
-	node->ss_currentScanDesc = currentScanDesc;
+	node->ss.ss_currentRelation = currentRelation;
 
 	/* and report the scan tuple slot's rowtype */
-	ExecAssignScanType(node, RelationGetDescr(currentRelation));
+	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
 }
 
 
@@ -167,44 +192,44 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 	 * create state structure
 	 */
 	scanstate = makeNode(SeqScanState);
-	scanstate->ps.plan = (Plan *) node;
-	scanstate->ps.state = estate;
+	scanstate->ss.ps.plan = (Plan *) node;
+	scanstate->ss.ps.state = estate;
 
 	/*
 	 * Miscellaneous initialization
 	 *
 	 * create expression context for node
 	 */
-	ExecAssignExprContext(estate, &scanstate->ps);
+	ExecAssignExprContext(estate, &scanstate->ss.ps);
 
 	/*
 	 * initialize child expressions
 	 */
-	scanstate->ps.targetlist = (List *)
+	scanstate->ss.ps.targetlist = (List *)
 		ExecInitExpr((Expr *) node->plan.targetlist,
 					 (PlanState *) scanstate);
-	scanstate->ps.qual = (List *)
+	scanstate->ss.ps.qual = (List *)
 		ExecInitExpr((Expr *) node->plan.qual,
 					 (PlanState *) scanstate);
 
 	/*
 	 * tuple table initialization
 	 */
-	ExecInitResultTupleSlot(estate, &scanstate->ps);
-	ExecInitScanTupleSlot(estate, scanstate);
+	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+	ExecInitScanTupleSlot(estate, &scanstate->ss);
 
 	/*
 	 * initialize scan relation
 	 */
 	InitScanRelation(scanstate, estate, eflags);
 
-	scanstate->ps.ps_TupFromTlist = false;
+	scanstate->ss.ps.ps_TupFromTlist = false;
 
 	/*
 	 * Initialize result tuple type and projection info.
 	 */
-	ExecAssignResultTypeFromTL(&scanstate->ps);
-	ExecAssignScanProjectionInfo(scanstate);
+	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+	ExecAssignScanProjectionInfo(&scanstate->ss);
 
 	return scanstate;
 }
@@ -224,24 +249,25 @@ ExecEndSeqScan(SeqScanState *node)
 	/*
 	 * get information from node
 	 */
-	relation = node->ss_currentRelation;
-	scanDesc = node->ss_currentScanDesc;
+	relation = node->ss.ss_currentRelation;
+	scanDesc = node->ss.ss_currentScanDesc;
 
 	/*
 	 * Free the exprcontext
 	 */
-	ExecFreeExprContext(&node->ps);
+	ExecFreeExprContext(&node->ss.ps);
 
 	/*
 	 * clean out the tuple table
 	 */
-	ExecClearTuple(node->ps.ps_ResultTupleSlot);
-	ExecClearTuple(node->ss_ScanTupleSlot);
+	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+	ExecClearTuple(node->ss.ss_ScanTupleSlot);
 
 	/*
 	 * close heap scan
 	 */
-	heap_endscan(scanDesc);
+	if (scanDesc)
+		heap_endscan(scanDesc);
 
 	/*
 	 * close the heap relation.
@@ -265,10 +291,74 @@ ExecReScanSeqScan(SeqScanState *node)
 {
 	HeapScanDesc scan;
 
-	scan = node->ss_currentScanDesc;
+	scan = node->ss.ss_currentScanDesc;
 
-	heap_rescan(scan,			/* scan desc */
-				NULL);			/* new scan keys */
+	if (scan)
+		heap_rescan(scan,			/* scan desc */
+					NULL);			/* new scan keys */
 
 	ExecScanReScan((ScanState *) node);
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecSeqScanEstimate
+ *
+ *		estimates the space required to serialize seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecSeqScanEstimate(SeqScanState *node,
+					ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->pscan_len = heap_parallelscan_estimate(estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->pscan_len);
+
+	/* key for partial scan information. */
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecSeqScanInitializeDSM
+ *
+ *		Initialize the DSM with the contents required to perform
+ *		parallel seqscan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecSeqScanInitializeDSM(SeqScanState *node,
+						 ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	/*
+	 * Store parallel heap scan descriptor in dynamic shared memory.
+	 */
+	node->pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
+	heap_parallelscan_initialize(node->pscan,
+								 node->ss.ss_currentRelation,
+								 estate->es_snapshot);
+	shm_toc_insert(pcxt->toc,
+				   node->ss.ps.plan->plan_node_id,
+				   node->pscan);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecPartialSeqScanInitParallelDesc
+ *
+ *		Retrieve the contents from DSM related to seq scan node
+ *		and initialize the seqscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecSeqScanInitParallelScanDesc(SeqScanState *node,
+								shm_toc *toc)
+{
+	node->pscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index c176ff9..bb5cfca 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -113,6 +113,7 @@ CopyPlanFields(const Plan *from, Plan *newnode)
 	COPY_SCALAR_FIELD(plan_rows);
 	COPY_SCALAR_FIELD(plan_width);
 	COPY_SCALAR_FIELD(plan_node_id);
+	COPY_SCALAR_FIELD(parallel_aware);
 	COPY_NODE_FIELD(targetlist);
 	COPY_NODE_FIELD(qual);
 	COPY_NODE_FIELD(lefttree);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3e75cd1..23ce9fe 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -272,6 +272,7 @@ _outPlanInfo(StringInfo str, const Plan *node)
 	WRITE_FLOAT_FIELD(plan_rows, "%.0f");
 	WRITE_INT_FIELD(plan_width);
 	WRITE_INT_FIELD(plan_node_id);
+	WRITE_BOOL_FIELD(parallel_aware);
 	WRITE_NODE_FIELD(targetlist);
 	WRITE_NODE_FIELD(qual);
 	WRITE_NODE_FIELD(lefttree);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 94ba6dc..7781ec0 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1413,6 +1413,7 @@ ReadCommonPlan(Plan *local_node)
 	READ_FLOAT_FIELD(plan_rows);
 	READ_INT_FIELD(plan_width);
 	READ_INT_FIELD(plan_node_id);
+	READ_BOOL_FIELD(parallel_aware);
 	READ_NODE_FIELD(targetlist);
 	READ_NODE_FIELD(qual);
 	READ_NODE_FIELD(lefttree);
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62..6e462b1 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-       joinpath.o joinrels.o pathkeys.o tidpath.o
+       joinpath.o joinrels.o pathkeys.o parallelpath.o tidpath.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8fc1cfd..8a00d7a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -475,7 +475,10 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	required_outer = rel->lateral_relids;
 
 	/* Consider sequential scan */
-	add_path(rel, create_seqscan_path(root, rel, required_outer));
+	add_path(rel, create_seqscan_path(root, rel, required_outer, 0));
+
+	/* Consider parallel scans */
+	create_parallelscan_paths(root, rel, required_outer);
 
 	/* Consider index scans */
 	create_index_paths(root, rel);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1b61fd9..7b82d7b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -181,10 +181,13 @@ clamp_row_est(double nrows)
  *
  * 'baserel' is the relation to be scanned
  * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'nworkers' are the number of workers among which the work will be
+ *			distributed if the scan is parallel scan
  */
 void
 cost_seqscan(Path *path, PlannerInfo *root,
-			 RelOptInfo *baserel, ParamPathInfo *param_info)
+			 RelOptInfo *baserel, ParamPathInfo *param_info,
+			 int nworkers)
 {
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
@@ -222,6 +225,26 @@ cost_seqscan(Path *path, PlannerInfo *root,
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 	run_cost += cpu_per_tuple * baserel->tuples;
 
+	/* account for the parallel workers */
+	if (nworkers > 0)
+	{
+		/*
+		 * Account small cost for communication related to scan via the
+		 * ParallelHeapScanDesc.
+		 */
+		run_cost += 0.01;
+
+		/*
+		 * Runtime cost will be equally shared by all workers. Here assumption is
+		 * that disk access cost will also be equally shared between workers which
+		 * is generally true unless there are too many workers working on a
+		 * relatively lesser number of blocks.  If we come across any such case,
+		 * then we can think of changing the current cost model for parallel
+		 * sequiantial scan.
+		 */
+		run_cost = run_cost / (nworkers + 1);
+	}
+
 	path->startup_cost = startup_cost;
 	path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/parallelpath.c b/src/backend/optimizer/path/parallelpath.c
new file mode 100644
index 0000000..1e84bdc
--- /dev/null
+++ b/src/backend/optimizer/path/parallelpath.c
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * parallelpath.c
+ *	  Routines to determine parallel paths for scanning a given relation.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/parallelpath.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "parser/parsetree.h"
+#include "utils/rel.h"
+
+
+/*
+ * expr_is_parallel_safe
+ *	  is a paraticular expression parallel safe
+ *
+ * Conditions checked here:
+ *
+ * 1. The expresion must not contain any parallel unsafe or parallel
+ * restricted functions.
+ *
+ * 2. The expression must not contain any initplan or subplan.  We can
+ * probably remove this restriction once we have support of infrastructure
+ * for execution of initplans and subplans at parallel (Gather) nodes.
+ */
+bool
+expr_is_parallel_safe(Node *node)
+{
+	if (check_parallel_safety(node, false))
+		return false;
+
+	if (contain_subplans_or_initplans(node))
+		return false;
+
+	return true;
+}
+
+/*
+ * create_parallelscan_paths
+ *	  Create paths corresponding to parallel scans of the given rel.
+ *	  Currently we only support partial sequential scan.
+ *
+ *	  Candidate paths are added to the rel's pathlist (using add_path).
+ */
+void
+create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer)
+{
+	int			num_parallel_workers = 0;
+	int			estimated_parallel_workers = 0;
+	Oid			reloid;
+	Relation	relation;
+	Path	   *subpath;
+	ListCell   *l;
+
+	/*
+	 * parallel scan is possible only if user has set parallel_seqscan_degree
+	 * to value greater than 0 and the query is parallel-safe.
+	 */
+	if (max_parallel_degree <= 0 || !root->glob->parallelModeOK)
+		return;
+
+	/*
+	 * There should be atleast a thousand pages to scan for each worker. This
+	 * number is somewhat arbitratry, however we don't want to spawn workers
+	 * to scan smaller relations as that will be costly.
+	 */
+	estimated_parallel_workers = rel->pages / 1000;
+
+	if (estimated_parallel_workers <= 0)
+		return;
+
+	reloid = planner_rt_fetch(rel->relid, root)->relid;
+
+	relation = heap_open(reloid, NoLock);
+
+	/*
+	 * Temporary relations can't be scanned by parallel workers as they are
+	 * visible only to local sessions.
+	 */
+	if (RelationUsesLocalBuffers(relation))
+	{
+		heap_close(relation, NoLock);
+		return;
+	}
+
+	heap_close(relation, NoLock);
+
+	/*
+	 * Allow parallel paths only if all the clauses for relation are parallel
+	 * safe.  We can allow execution of parallel restricted clauses in master
+	 * backend, but for that planner should have infrastructure to pull all
+	 * the parallel restricted clauses from below nodes to the Gather node
+	 * which will then execute such clauses in master backend.
+	 */
+	foreach(l, rel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (!expr_is_parallel_safe((Node *) rinfo->clause))
+			return;
+	}
+
+	num_parallel_workers = Min(max_parallel_degree,
+							   estimated_parallel_workers);
+
+	/*
+	 * Create the partial scan path which each worker backend needs to
+	 * execute.
+	 */
+	subpath = create_seqscan_path(root, rel, required_outer,
+								  num_parallel_workers);
+
+	/* Create the gather path which master backend needs to execute. */
+	add_path(rel, (Path *) create_gather_path(root, rel, subpath,
+											  required_outer,
+											  num_parallel_workers));
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 791b64e..ead928e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -3408,8 +3408,9 @@ order_qual_clauses(PlannerInfo *root, List *clauses)
 }
 
 /*
- * Copy cost and size info from a Path node to the Plan node created from it.
- * The executor usually won't use this info, but it's needed by EXPLAIN.
+ * Copy cost, size and parallel mode info from a Path node to the Plan node
+ * created from it. The executor usually won't use cost and size info, but
+ * that is needed by EXPLAIN.
  */
 static void
 copy_path_costsize(Plan *dest, Path *src)
@@ -3420,6 +3421,7 @@ copy_path_costsize(Plan *dest, Path *src)
 		dest->total_cost = src->total_cost;
 		dest->plan_rows = src->rows;
 		dest->plan_width = src->parent->width;
+		dest->parallel_aware = src->parallel_aware;
 	}
 	else
 	{
@@ -3427,6 +3429,7 @@ copy_path_costsize(Plan *dest, Path *src)
 		dest->total_cost = 0;
 		dest->plan_rows = 0;
 		dest->plan_width = 0;
+		dest->parallel_aware = 0;
 	}
 }
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 536b55e..880bdbd 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -202,13 +202,13 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	glob->hasRowSecurity = false;
 
 	/*
-	 * Assess whether it's feasible to use parallel mode for this query.
-	 * We can't do this in a standalone backend, or if the command will
-	 * try to modify any data, or if this is a cursor operation, or if any
+	 * Assess whether it's feasible to use parallel mode for this query. We
+	 * can't do this in a standalone backend, or if the command will try to
+	 * modify any data, or if this is a cursor operation, or if any
 	 * parallel-unsafe functions are present in the query tree.
 	 *
-	 * For now, we don't try to use parallel mode if we're running inside
-	 * a parallel worker.  We might eventually be able to relax this
+	 * For now, we don't try to use parallel mode if we're running inside a
+	 * parallel worker.  We might eventually be able to relax this
 	 * restriction, but for now it seems best not to have parallel workers
 	 * trying to create their own parallel workers.
 	 *
@@ -225,7 +225,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		parse->commandType == CMD_SELECT && !parse->hasModifyingCTE &&
 		parse->utilityStmt == NULL && !IsParallelWorker() &&
 		!IsolationIsSerializable() &&
-		!contain_parallel_unsafe((Node *) parse);
+		!check_parallel_safety((Node *) parse, true);
 
 	/*
 	 * glob->parallelModeOK should tell us whether it's necessary to impose
@@ -238,9 +238,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 	 *
 	 * (It's been suggested that we should always impose these restrictions
 	 * whenever glob->parallelModeOK is true, so that it's easier to notice
-	 * incorrectly-labeled functions sooner.  That might be the right thing
-	 * to do, but for now I've taken this approach.  We could also control
-	 * this with a GUC.)
+	 * incorrectly-labeled functions sooner.  That might be the right thing to
+	 * do, but for now I've taken this approach.  We could also control this
+	 * with a GUC.)
 	 *
 	 * FIXME: It's assumed that code further down will set parallelModeNeeded
 	 * to true if a parallel path is actually chosen.  Since the core
@@ -4690,7 +4690,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	comparisonCost = 2.0 * (indexExprCost.startup + indexExprCost.per_tuple);
 
 	/* Estimate the cost of seq scan + sort */
-	seqScanPath = create_seqscan_path(root, rel, NULL);
+	seqScanPath = create_seqscan_path(root, rel, NULL, 0);
 	cost_sort(&seqScanAndSortPath, root, NIL,
 			  seqScanPath->total_cost, rel->tuples, rel->width,
 			  comparisonCost, maintenance_work_mem, -1.0);
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index f2c8551..2355cc6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -87,16 +87,25 @@ typedef struct
 	char	   *prosrc;
 } inline_error_callback_arg;
 
+typedef struct
+{
+	bool		allow_restricted;
+}	check_parallel_safety_arg;
+
 static bool contain_agg_clause_walker(Node *node, void *context);
 static bool count_agg_clauses_walker(Node *node,
 						 count_agg_clauses_context *context);
 static bool find_window_functions_walker(Node *node, WindowFuncLists *lists);
 static bool expression_returns_set_rows_walker(Node *node, double *count);
 static bool contain_subplans_walker(Node *node, void *context);
+static bool contain_subplans_or_initplans_walker(Node *node, void *context);
 static bool contain_mutable_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_walker(Node *node, void *context);
 static bool contain_volatile_functions_not_nextval_walker(Node *node, void *context);
-static bool contain_parallel_unsafe_walker(Node *node, void *context);
+static bool check_parallel_safety_walker(Node *node,
+							 check_parallel_safety_arg * context);
+static bool parallel_too_dangerous(char proparallel,
+					   check_parallel_safety_arg * context);
 static bool contain_nonstrict_functions_walker(Node *node, void *context);
 static bool contain_leaked_vars_walker(Node *node, void *context);
 static Relids find_nonnullable_rels_walker(Node *node, bool top_level);
@@ -1204,13 +1213,16 @@ contain_volatile_functions_not_nextval_walker(Node *node, void *context)
  *****************************************************************************/
 
 bool
-contain_parallel_unsafe(Node *node)
+check_parallel_safety(Node *node, bool allow_restricted)
 {
-	return contain_parallel_unsafe_walker(node, NULL);
+	check_parallel_safety_arg context;
+
+	context.allow_restricted = allow_restricted;
+	return check_parallel_safety_walker(node, &context);
 }
 
 static bool
-contain_parallel_unsafe_walker(Node *node, void *context)
+check_parallel_safety_walker(Node *node, check_parallel_safety_arg * context)
 {
 	if (node == NULL)
 		return false;
@@ -1218,7 +1230,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 	{
 		FuncExpr   *expr = (FuncExpr *) node;
 
-		if (func_parallel(expr->funcid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->funcid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1227,7 +1239,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		OpExpr	   *expr = (OpExpr *) node;
 
 		set_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1236,7 +1248,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		DistinctExpr *expr = (DistinctExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1245,7 +1257,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		NullIfExpr *expr = (NullIfExpr *) node;
 
 		set_opfuncid((OpExpr *) expr);	/* rely on struct equivalence */
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1254,7 +1266,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) node;
 
 		set_sa_opfuncid(expr);
-		if (func_parallel(expr->opfuncid) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(expr->opfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1268,12 +1280,12 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		/* check the result type's input function */
 		getTypeInputInfo(expr->resulttype,
 						 &iofunc, &typioparam);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* check the input type's output function */
 		getTypeOutputInfo(exprType((Node *) expr->arg),
 						  &iofunc, &typisvarlena);
-		if (func_parallel(iofunc) == PROPARALLEL_UNSAFE)
+		if (parallel_too_dangerous(func_parallel(iofunc), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1282,7 +1294,7 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 		ArrayCoerceExpr *expr = (ArrayCoerceExpr *) node;
 
 		if (OidIsValid(expr->elemfuncid) &&
-			func_parallel(expr->elemfuncid) == PROPARALLEL_UNSAFE)
+			parallel_too_dangerous(func_parallel(expr->elemfuncid), context))
 			return true;
 		/* else fall through to check args */
 	}
@@ -1294,28 +1306,77 @@ contain_parallel_unsafe_walker(Node *node, void *context)
 
 		foreach(opid, rcexpr->opnos)
 		{
-			if (op_volatile(lfirst_oid(opid)) == PROPARALLEL_UNSAFE)
+			if (parallel_too_dangerous(op_volatile(lfirst_oid(opid)), context))
 				return true;
 		}
 		/* else fall through to check args */
 	}
 	else if (IsA(node, Query))
 	{
-		Query *query = (Query *) node;
+		Query	   *query = (Query *) node;
 
 		if (query->rowMarks != NULL)
 			return true;
 
 		/* Recurse into subselects */
 		return query_tree_walker(query,
-								 contain_parallel_unsafe_walker,
+								 check_parallel_safety_walker,
 								 context, 0);
 	}
 	return expression_tree_walker(node,
-								  contain_parallel_unsafe_walker,
+								  check_parallel_safety_walker,
 								  context);
 }
 
+static bool
+parallel_too_dangerous(char proparallel, check_parallel_safety_arg * context)
+{
+	if (context->allow_restricted)
+		return proparallel == PROPARALLEL_UNSAFE;
+	else
+		return proparallel != PROPARALLEL_SAFE;
+}
+
+/*
+ * contain_subplans_or_initplans
+ *	  Recursively search for initplan or subplan nodes within a clause.
+ *
+ * A special purpose function for prohibiting subplan or initplan clauses
+ * in parallel query constructs.
+ *
+ * If we see any form of SubPlan node, we will return TRUE.  For InitPlan's,
+ * we return true when we see the Param node, apart from that InitPlan
+ * can contain a simple NULL constant for MULTIEXPR subquery (see comments
+ * in make_subplan), however it is okay not to care about the same as that
+ * is only possible for Update statement which is anyway prohibited.
+ *
+ * Returns true if any subplan or initplan is found.
+ */
+bool
+contain_subplans_or_initplans(Node *clause)
+{
+	return contain_subplans_or_initplans_walker(clause, NULL);
+}
+
+static bool
+contain_subplans_or_initplans_walker(Node *node, void *context)
+{
+	if (node == NULL)
+		return false;
+	if (IsA(node, SubPlan) ||
+		IsA(node, AlternativeSubPlan) ||
+		IsA(node, SubLink))
+		return true;			/* abort the tree traversal and return true */
+	else if (IsA(node, Param))
+	{
+		Param	   *paramval = (Param *) node;
+
+		if (paramval->paramkind == PARAM_EXEC)
+			return true;
+	}
+	return expression_tree_walker(node, contain_subplans_or_initplans_walker, context);
+}
+
 /*****************************************************************************
  *		Check clauses for nonstrict functions
  *****************************************************************************/
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1895a68..342020d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -696,7 +696,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  pathnode.
  */
 Path *
-create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
+create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
+					Relids required_outer, int nworkers)
 {
 	Path	   *pathnode = makeNode(Path);
 
@@ -706,7 +707,10 @@ create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
 													 required_outer);
 	pathnode->pathkeys = NIL;	/* seqscan has unordered result */
 
-	cost_seqscan(pathnode, root, rel, pathnode->param_info);
+	/* presence of workers indicate that this is parallel path */
+	pathnode->parallel_aware = nworkers ? true : false;
+
+	cost_seqscan(pathnode, root, rel, pathnode->param_info, nworkers);
 
 	return pathnode;
 }
@@ -1798,7 +1802,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 	switch (path->pathtype)
 	{
 		case T_SeqScan:
-			return create_seqscan_path(root, rel, required_outer);
+			return create_seqscan_path(root, rel, required_outer, 0);
 		case T_SampleScan:
 			return (Path *) create_samplescan_path(root, rel, required_outer);
 		case T_IndexScan:
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index 39d12a6..d0dc1f6 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -20,5 +20,8 @@ extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
+extern void ExecSeqScanEstimate(SeqScanState *node, ParallelContext *pcxt);
+extern void ExecSeqScanInitializeDSM(SeqScanState *node, ParallelContext *pcxt);
+extern void ExecSeqScanInitParallelScanDesc(SeqScanState *node, shm_toc *toc);
 
 #endif   /* NODESEQSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 58ec889..1f52132 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/heapam.h"
+#include "access/parallel.h"
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
@@ -1249,10 +1250,28 @@ typedef struct ScanState
 } ScanState;
 
 /*
- * SeqScan uses a bare ScanState as its state node, since it needs
- * no additional fields.
+ * SeqScanState extends ScanState by storing additional information
+ * related to parallel scan for parallel-aware sequential scan plan.
  */
-typedef ScanState SeqScanState;
+typedef struct SeqScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelHeapScanDesc	pscan;	/* parallel heap scan descriptor
+									 * for partial scan */
+	Size		pscan_len;		/* size of parallel heap scan descriptor */
+} SeqScanState;
+
+/*
+ * PartialSeqScanState extends ScanState by storing additional information
+ * related to scan.
+ */
+typedef struct PartialSeqScanState
+{
+	ScanState	ss;				/* its first field is NodeTag */
+	ParallelHeapScanDesc	pscan;	/* parallel heap scan descriptor
+									 * for partial scan */
+	Size		pscan_len;		/* size of parallel heap scan descriptor */
+} PartialSeqScanState;
 
 /* ----------------
  *	 SampleScanState information
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6b28c8e..ed79dae 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -72,7 +72,7 @@ typedef struct PlannedStmt
 
 	bool		hasRowSecurity; /* row security applied? */
 
-	bool		parallelModeNeeded; /* parallel mode required to execute? */
+	bool		parallelModeNeeded;		/* parallel mode required to execute? */
 } PlannedStmt;
 
 /* macro for fetching the Plan associated with a SubPlan node */
@@ -112,6 +112,7 @@ typedef struct Plan
 	 * Common structural data for all Plan types.
 	 */
 	int			plan_node_id;	/* unique across entire final plan tree */
+	bool		parallel_aware; /* is this plan node parallel-aware? */
 	List	   *targetlist;		/* target list to be computed at this node */
 	List	   *qual;			/* implicitly-ANDed qual conditions */
 	struct Plan *lefttree;		/* input plan tree(s) */
@@ -287,6 +288,12 @@ typedef struct Scan
 typedef Scan SeqScan;
 
 /* ----------------
+ *		partial sequential scan node
+ * ----------------
+ */
+typedef SeqScan PartialSeqScan;
+
+/* ----------------
  *		table sample scan node
  * ----------------
  */
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 6cf2e24..f11099f 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -754,6 +754,8 @@ typedef struct Path
 	RelOptInfo *parent;			/* the relation this path can build */
 	ParamPathInfo *param_info;	/* parameterization info, or NULL if none */
 
+	bool		parallel_aware; /* is this path parallel-aware? */
+
 	/* estimated size/costs for path (see costsize.c for more info) */
 	double		rows;			/* estimated number of result tuples */
 	Cost		startup_cost;	/* cost expended before fetching any tuples */
diff --git a/src/include/optimizer/clauses.h b/src/include/optimizer/clauses.h
index 5ac79b1..747b05b 100644
--- a/src/include/optimizer/clauses.h
+++ b/src/include/optimizer/clauses.h
@@ -62,7 +62,8 @@ extern bool contain_subplans(Node *clause);
 extern bool contain_mutable_functions(Node *clause);
 extern bool contain_volatile_functions(Node *clause);
 extern bool contain_volatile_functions_not_nextval(Node *clause);
-extern bool contain_parallel_unsafe(Node *node);
+extern bool check_parallel_safety(Node *node, bool allow_restricted);
+extern bool contain_subplans_or_initplans(Node *clause);
 extern bool contain_nonstrict_functions(Node *clause);
 extern bool contain_leaked_vars(Node *clause);
 
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 25a7303..ac21a3a 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -72,7 +72,7 @@ extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
 					double index_pages, PlannerInfo *root);
 extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
-			 ParamPathInfo *param_info);
+			 ParamPathInfo *param_info, int nworkers);
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7a4940c..f28b4e2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -31,7 +31,7 @@ extern bool add_path_precheck(RelOptInfo *parent_rel,
 				  List *pathkeys, Relids required_outer);
 
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
-					Relids required_outer);
+					Relids required_outer, int nworkers);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
 					   Relids required_outer);
 extern IndexPath *create_index_path(PlannerInfo *root,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 87123a5..6cd4479 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -55,6 +55,15 @@ extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
 #endif
 
 /*
+ * parallelpath.c
+ *	  routines to generate parallel scan paths
+ */
+
+extern void create_parallelscan_paths(PlannerInfo *root, RelOptInfo *rel,
+						  Relids required_outer);
+extern bool expr_is_parallel_safe(Node *node);
+
+/*
  * indxpath.c
  *	  routines to generate index paths
  */
#443Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#442)
Re: Parallel Seq Scan

On Mon, Nov 9, 2015 at 11:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have updated the patch to make seq scan node parallel aware.
To make that happen we need to have parallel_aware flag both in Plan
as well as Path, so that we can pass that information from Path to Plan.
I think the right place to copy parallel_aware info from path to
plan is copy_path_costsize and ideally we should change the name
of function to something like copy_generic_path_info(), but for
now I have retained it's original name as it is used at many places,
let me know if you think we should goahead and change the name
of function as well.

I have changed Explain as well so that it adds Parallel for Seq Scan if
SeqScan node is parallel_aware.

I have not integrated it with consider-parallel patch, so that this and
Partial Seq Scan version of the patch can be compared without much
difficulity.

Thoughts?

I've committed most of this, except for some planner bits that I
didn't like, and after a bunch of cleanup. Instead, I committed the
consider-parallel-v2.patch with some additional planner bits to make
up for the ones I removed from your patch. So, now we have parallel
sequential scan!

For those following along at home, here's a demo:

rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 213.412 ms

This is all pretty primitive at this point - there are still lots of
things that need to be fixed and improved, and it applies to only the
very simplest cases at present, but, hey, parallel query. Check it
out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#444Amit Langote
amitlangote09@gmail.com
In reply to: Robert Haas (#443)
Re: Parallel Seq Scan

On Wed, Nov 11, 2015 at 11:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

For those following along at home, here's a demo:

rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 213.412 ms

This is all pretty primitive at this point - there are still lots of
things that need to be fixed and improved, and it applies to only the
very simplest cases at present, but, hey, parallel query. Check it
out.

Yay! Great work guys!

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#445Thom Brown
thom@linux.com
In reply to: Robert Haas (#443)
Re: Parallel Seq Scan

On 11 November 2015 at 14:53, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 9, 2015 at 11:15 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have updated the patch to make seq scan node parallel aware.
To make that happen we need to have parallel_aware flag both in Plan
as well as Path, so that we can pass that information from Path to Plan.
I think the right place to copy parallel_aware info from path to
plan is copy_path_costsize and ideally we should change the name
of function to something like copy_generic_path_info(), but for
now I have retained it's original name as it is used at many places,
let me know if you think we should goahead and change the name
of function as well.

I have changed Explain as well so that it adds Parallel for Seq Scan if
SeqScan node is parallel_aware.

I have not integrated it with consider-parallel patch, so that this and
Partial Seq Scan version of the patch can be compared without much
difficulity.

Thoughts?

I've committed most of this, except for some planner bits that I
didn't like, and after a bunch of cleanup. Instead, I committed the
consider-parallel-v2.patch with some additional planner bits to make
up for the ones I removed from your patch. So, now we have parallel
sequential scan!

For those following along at home, here's a demo:

rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 213.412 ms

This is all pretty primitive at this point - there are still lots of
things that need to be fixed and improved, and it applies to only the
very simplest cases at present, but, hey, parallel query. Check it
out.

Congratulations to both you and Amit. This is a significant landmark
in PostgreSQL feature development.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#446Pavel Stehule
pavel.stehule@gmail.com
In reply to: Thom Brown (#445)
Re: Parallel Seq Scan

2015-11-11 16:18 GMT+01:00 Thom Brown <thom@linux.com>:

On 11 November 2015 at 14:53, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 9, 2015 at 11:15 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Okay, I have updated the patch to make seq scan node parallel aware.
To make that happen we need to have parallel_aware flag both in Plan
as well as Path, so that we can pass that information from Path to Plan.
I think the right place to copy parallel_aware info from path to
plan is copy_path_costsize and ideally we should change the name
of function to something like copy_generic_path_info(), but for
now I have retained it's original name as it is used at many places,
let me know if you think we should goahead and change the name
of function as well.

I have changed Explain as well so that it adds Parallel for Seq Scan if
SeqScan node is parallel_aware.

I have not integrated it with consider-parallel patch, so that this and
Partial Seq Scan version of the patch can be compared without much
difficulity.

Thoughts?

I've committed most of this, except for some planner bits that I
didn't like, and after a bunch of cleanup. Instead, I committed the
consider-parallel-v2.patch with some additional planner bits to make
up for the ones I removed from your patch. So, now we have parallel
sequential scan!

For those following along at home, here's a demo:

rhaas=# \timing
Timing is on.
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 743.061 ms
rhaas=# set max_parallel_degree = 4;
SET
Time: 0.270 ms
rhaas=# select * from pgbench_accounts where filler like '%a%';
aid | bid | abalance | filler
-----+-----+----------+--------
(0 rows)

Time: 213.412 ms

This is all pretty primitive at this point - there are still lots of
things that need to be fixed and improved, and it applies to only the
very simplest cases at present, but, hey, parallel query. Check it
out.

Congratulations to both you and Amit. This is a significant landmark
in PostgreSQL feature development.

+1

Pavel

Show quoted text

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#447Pavel Stehule
pavel.stehule@gmail.com
In reply to: Pavel Stehule (#446)
Re: Parallel Seq Scan

Hi

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

postgres=# set max_parallel_degree to 4;
SET
Time: 0.717 ms
postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Aggregate (cost=9282.50..9282.51 rows=1 width=0) (actual
time=142.541..142.541 rows=1 loops=1) │
│ -> Gather (cost=1000.00..9270.00 rows=5000 width=0) (actual
time=0.633..130.926 rows=100000 loops=1) │
│ Number of Workers:
2

│ -> Parallel Seq Scan on xxx (cost=0.00..7770.00 rows=5000
width=0) (actual time=0.052..411.303 rows=169631 loops=1) │
│ Filter: ((a % 10) =
0)

│ Rows Removed by Filter:
1526399

│ Planning time: 0.167
ms

│ Execution time: 144.519
ms

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Time: 145.374 ms
postgres=# set max_parallel_degree to 1;
SET
Time: 0.706 ms
postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY
PLAN │
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Aggregate (cost=14462.50..14462.51 rows=1 width=0) (actual
time=163.355..163.355 rows=1 loops=1) │
│ -> Gather (cost=1000.00..14450.00 rows=5000 width=0) (actual
time=0.485..152.827 rows=100000 loops=1) │
│ Number of Workers:
1

│ -> Parallel Seq Scan on xxx (cost=0.00..12950.00 rows=5000
width=0) (actual time=0.043..309.740 rows=145364 loops=1) │
│ Filter: ((a % 10) =
0)

│ Rows Removed by Filter:
1308394

│ Planning time: 0.129
ms

│ Execution time: 165.102
ms

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Rows removed by filter: 1308394 X 1526399. Is it expected?

#448Thom Brown
thom@linux.com
In reply to: Pavel Stehule (#447)
Re: Parallel Seq Scan

On 11 November 2015 at 17:59, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

postgres=# set max_parallel_degree to 4;
SET
Time: 0.717 ms
postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN

╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Aggregate (cost=9282.50..9282.51 rows=1 width=0) (actual
time=142.541..142.541 rows=1 loops=1) │
│ -> Gather (cost=1000.00..9270.00 rows=5000 width=0) (actual
time=0.633..130.926 rows=100000 loops=1) │
│ Number of Workers: 2

│ -> Parallel Seq Scan on xxx (cost=0.00..7770.00 rows=5000
width=0) (actual time=0.052..411.303 rows=169631 loops=1) │
│ Filter: ((a % 10) = 0)

│ Rows Removed by Filter: 1526399

│ Planning time: 0.167 ms

│ Execution time: 144.519 ms

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Time: 145.374 ms
postgres=# set max_parallel_degree to 1;
SET
Time: 0.706 ms
postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN

╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Aggregate (cost=14462.50..14462.51 rows=1 width=0) (actual
time=163.355..163.355 rows=1 loops=1) │
│ -> Gather (cost=1000.00..14450.00 rows=5000 width=0) (actual
time=0.485..152.827 rows=100000 loops=1) │
│ Number of Workers: 1

│ -> Parallel Seq Scan on xxx (cost=0.00..12950.00 rows=5000
width=0) (actual time=0.043..309.740 rows=145364 loops=1) │
│ Filter: ((a % 10) = 0)

│ Rows Removed by Filter: 1308394

│ Planning time: 0.129 ms

│ Execution time: 165.102 ms

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Rows removed by filter: 1308394 X 1526399. Is it expected?

Yeah, I noticed the same thing, but more pronounced:

With set max_parallel_degree = 4:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=49575.51..49575.52 rows=1 width=0) (actual
time=744.267..744.267 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175423
-> Gather (cost=1000.00..49544.27 rows=12496 width=0) (actual
time=0.351..731.662 rows=55151 loops=1)
Output: content
Number of Workers: 4
Buffers: shared hit=175423
-> Parallel Seq Scan on public.js (cost=0.00..47294.67
rows=12496 width=0) (actual time=0.030..5912.118 rows=96062 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 2085546
Buffers: shared hit=305123
Planning time: 0.123 ms
Execution time: 759.313 ms
(14 rows)

With set max_parallel_degree = 0:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=212857.25..212857.26 rows=1 width=0) (actual
time=1235.082..1235.082 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175243
-> Seq Scan on public.js (cost=0.00..212826.01 rows=12496
width=0) (actual time=0.019..1228.515 rows=55151 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 1197822
Buffers: shared hit=175243
Planning time: 0.064 ms
Execution time: 1235.108 ms
(10 rows)

Time: 1235.517 ms

Rows removed: 2085546 vs 1197822
Buffers hit: 305123 vs 175243

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#449Pavel Stehule
pavel.stehule@gmail.com
In reply to: Thom Brown (#448)
Re: Parallel Seq Scan

2015-11-11 19:03 GMT+01:00 Thom Brown <thom@linux.com>:

On 11 November 2015 at 17:59, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Hi

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

postgres=# set max_parallel_degree to 4;
SET
Time: 0.717 ms
postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐

│ QUERY PLAN

╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡

│ Aggregate (cost=9282.50..9282.51 rows=1 width=0) (actual
time=142.541..142.541 rows=1 loops=1) │
│ -> Gather (cost=1000.00..9270.00 rows=5000 width=0) (actual
time=0.633..130.926 rows=100000 loops=1) │
│ Number of Workers: 2

│ -> Parallel Seq Scan on xxx (cost=0.00..7770.00 rows=5000
width=0) (actual time=0.052..411.303 rows=169631 loops=1) │
│ Filter: ((a % 10) = 0)

│ Rows Removed by Filter: 1526399

│ Planning time: 0.167 ms

│ Execution time: 144.519 ms

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

(8 rows)

Time: 145.374 ms
postgres=# set max_parallel_degree to 1;
SET
Time: 0.706 ms
postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐

│ QUERY PLAN

╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡

│ Aggregate (cost=14462.50..14462.51 rows=1 width=0) (actual
time=163.355..163.355 rows=1 loops=1) │
│ -> Gather (cost=1000.00..14450.00 rows=5000 width=0) (actual
time=0.485..152.827 rows=100000 loops=1) │
│ Number of Workers: 1

│ -> Parallel Seq Scan on xxx (cost=0.00..12950.00 rows=5000
width=0) (actual time=0.043..309.740 rows=145364 loops=1) │
│ Filter: ((a % 10) = 0)

│ Rows Removed by Filter: 1308394

│ Planning time: 0.129 ms

│ Execution time: 165.102 ms

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

(8 rows)

Rows removed by filter: 1308394 X 1526399. Is it expected?

Yeah, I noticed the same thing, but more pronounced:

With set max_parallel_degree = 4:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=49575.51..49575.52 rows=1 width=0) (actual
time=744.267..744.267 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175423
-> Gather (cost=1000.00..49544.27 rows=12496 width=0) (actual
time=0.351..731.662 rows=55151 loops=1)
Output: content
Number of Workers: 4
Buffers: shared hit=175423
-> Parallel Seq Scan on public.js (cost=0.00..47294.67
rows=12496 width=0) (actual time=0.030..5912.118 rows=96062 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 2085546
Buffers: shared hit=305123
Planning time: 0.123 ms
Execution time: 759.313 ms
(14 rows)

With set max_parallel_degree = 0:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=212857.25..212857.26 rows=1 width=0) (actual
time=1235.082..1235.082 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175243
-> Seq Scan on public.js (cost=0.00..212826.01 rows=12496
width=0) (actual time=0.019..1228.515 rows=55151 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 1197822
Buffers: shared hit=175243
Planning time: 0.064 ms
Execution time: 1235.108 ms
(10 rows)

Time: 1235.517 ms

Rows removed: 2085546 vs 1197822
Buffers hit: 305123 vs 175243

yes - the another little bit unclean in EXPLAIN is number of workers. If I
understand to the behave, the query is processed by two processes if
workers in the explain is one.

Regards

Pavel

Show quoted text

Thom

#450Robert Haas
robertmhaas@gmail.com
In reply to: Pavel Stehule (#447)
Re: Parallel Seq Scan

On Wed, Nov 11, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Hmm, I see I was right about people finding more bugs once this was
committed. That didn't take long.

There's supposed to be code to handle this - see the
SharedPlanStateInstrumentation stuff in execParallel.c - but it's
evidently a few bricks shy of a load.
ExecParallelReportInstrumentation is supposed to transfer the counts
from each worker to the DSM:

ps_instrument = &instrumentation->ps_instrument[i];
SpinLockAcquire(&ps_instrument->mutex);
InstrAggNode(&ps_instrument->instr, planstate->instrument);
SpinLockRelease(&ps_instrument->mutex);

And ExecParallelRetrieveInstrumentation is supposed to slurp those
counts back into the leader's PlanState objects:

/* No need to acquire the spinlock here; workers have exited already. */
ps_instrument = &instrumentation->ps_instrument[i];
InstrAggNode(planstate->instrument, &ps_instrument->instr);

This might be a race condition, or it might be just wrong logic.
Could you test what happens if you insert something like a 1-second
sleep in ExecParallelFinish just after the call to
WaitForParallelWorkersToFinish()? If that makes the results
consistent, this is a race. If it doesn't, something else is wrong:
then it would be useful to know whether the workers are actually
calling ExecParallelReportInstrumentation, and whether the leader is
actually calling ExecParallelRetrieveInstrumentation, and if so
whether they are doing it for the correct set of nodes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#451Pavel Stehule
pavel.stehule@gmail.com
In reply to: Robert Haas (#450)
Re: Parallel Seq Scan

2015-11-11 20:26 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:

On Wed, Nov 11, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Hmm, I see I was right about people finding more bugs once this was
committed. That didn't take long.

It is super feature, nobody can to wait to check it :). Much more people
can to put feedback and can do tests now.

There's supposed to be code to handle this - see the
SharedPlanStateInstrumentation stuff in execParallel.c - but it's
evidently a few bricks shy of a load.
ExecParallelReportInstrumentation is supposed to transfer the counts
from each worker to the DSM:

ps_instrument = &instrumentation->ps_instrument[i];
SpinLockAcquire(&ps_instrument->mutex);
InstrAggNode(&ps_instrument->instr, planstate->instrument);
SpinLockRelease(&ps_instrument->mutex);

And ExecParallelRetrieveInstrumentation is supposed to slurp those
counts back into the leader's PlanState objects:

/* No need to acquire the spinlock here; workers have exited
already. */
ps_instrument = &instrumentation->ps_instrument[i];
InstrAggNode(planstate->instrument, &ps_instrument->instr);

This might be a race condition, or it might be just wrong logic.
Could you test what happens if you insert something like a 1-second
sleep in ExecParallelFinish just after the call to
WaitForParallelWorkersToFinish()? If that makes the results
consistent, this is a race. If it doesn't, something else is wrong:
then it would be useful to know whether the workers are actually
calling ExecParallelReportInstrumentation, and whether the leader is
actually calling ExecParallelRetrieveInstrumentation, and if so
whether they are doing it for the correct set of nodes.

I did there pg_usleep(1000000L) without success

postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;
QUERY
PLAN
═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Aggregate (cost=9282.50..9282.51 rows=1 width=0) (actual
time=154.535..154.535 rows=1 loops=1)
-> Gather (cost=1000.00..9270.00 rows=5000 width=0) (actual
time=0.675..142.320 rows=100000 loops=1)
Number of Workers: 2
-> Parallel Seq Scan on xxx (cost=0.00..7770.00 rows=5000
width=0) (actual time=0.075..445.999 rows=168927 loops=1)
Filter: ((a % 10) = 0)
Rows Removed by Filter: 1520549
Planning time: 0.117 ms
Execution time: 1155.505 ms
(8 rows)

expected

postgres=# EXPLAIN ANALYZE select count(*) from xxx where a % 10 = 0;
QUERY
PLAN
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════
Aggregate (cost=19437.50..19437.51 rows=1 width=0) (actual
time=171.233..171.233 rows=1 loops=1)
-> Seq Scan on xxx (cost=0.00..19425.00 rows=5000 width=0) (actual
time=0.187..162.627 rows=100000 loops=1)
Filter: ((a % 10) = 0)
Rows Removed by Filter: 900000
Planning time: 0.119 ms
Execution time: 171.322 ms
(6 rows)

The tests is based on table xxx

create table xxx(a int);
insert into xxx select generate_series(1,1000000);

Show quoted text

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#452Thom Brown
thom@linux.com
In reply to: Robert Haas (#450)
Re: Parallel Seq Scan

On 11 November 2015 at 19:26, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 11, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Hmm, I see I was right about people finding more bugs once this was
committed. That didn't take long.

There's supposed to be code to handle this - see the
SharedPlanStateInstrumentation stuff in execParallel.c - but it's
evidently a few bricks shy of a load.
ExecParallelReportInstrumentation is supposed to transfer the counts
from each worker to the DSM:

ps_instrument = &instrumentation->ps_instrument[i];
SpinLockAcquire(&ps_instrument->mutex);
InstrAggNode(&ps_instrument->instr, planstate->instrument);
SpinLockRelease(&ps_instrument->mutex);

And ExecParallelRetrieveInstrumentation is supposed to slurp those
counts back into the leader's PlanState objects:

/* No need to acquire the spinlock here; workers have exited already. */
ps_instrument = &instrumentation->ps_instrument[i];
InstrAggNode(planstate->instrument, &ps_instrument->instr);

This might be a race condition, or it might be just wrong logic.
Could you test what happens if you insert something like a 1-second
sleep in ExecParallelFinish just after the call to
WaitForParallelWorkersToFinish()? If that makes the results
consistent, this is a race. If it doesn't, something else is wrong:
then it would be useful to know whether the workers are actually
calling ExecParallelReportInstrumentation, and whether the leader is
actually calling ExecParallelRetrieveInstrumentation, and if so
whether they are doing it for the correct set of nodes.

Hmm.. I made the change, but clearly it's not sleeping properly with
my change (I'm expecting a total runtime in excess of 1 second):

max_parallel_degree = 4:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=49578.18..49578.19 rows=1 width=0) (actual
time=797.518..797.518 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=174883 read=540
-> Gather (cost=1000.00..49546.93 rows=12500 width=0) (actual
time=0.245..784.959 rows=55151 loops=1)
Output: content
Number of Workers: 4
Buffers: shared hit=174883 read=540
-> Parallel Seq Scan on public.js (cost=0.00..47296.93
rows=12500 width=0) (actual time=0.019..6153.679 rows=94503 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 2051330
Buffers: shared hit=299224 read=907
Planning time: 0.086 ms
Execution time: 803.026 ms

max_parallel_degree = 0:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=212867.43..212867.44 rows=1 width=0) (actual
time=1278.717..1278.717 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=174671 read=572
-> Seq Scan on public.js (cost=0.00..212836.18 rows=12500
width=0) (actual time=0.018..1272.030 rows=55151 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 1197822
Buffers: shared hit=174671 read=572
Planning time: 0.064 ms
Execution time: 1278.741 ms
(10 rows)

Time: 1279.145 ms

I did, however, notice that repeated runs of the query with
max_parallel_degree = 4 yields different counts of rows removed by
filter:

Run 1: 2051330
Run 2: 2081252
Run 3: 2065112
Run 4: 2022045
Run 5: 2025384
Run 6: 2059360
Run 7: 2079620
Run 8: 2058541

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#453Thom Brown
thom@linux.com
In reply to: Thom Brown (#452)
Re: Parallel Seq Scan

On 11 November 2015 at 19:51, Thom Brown <thom@linux.com> wrote:

On 11 November 2015 at 19:26, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 11, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Hmm, I see I was right about people finding more bugs once this was
committed. That didn't take long.

There's supposed to be code to handle this - see the
SharedPlanStateInstrumentation stuff in execParallel.c - but it's
evidently a few bricks shy of a load.
ExecParallelReportInstrumentation is supposed to transfer the counts
from each worker to the DSM:

ps_instrument = &instrumentation->ps_instrument[i];
SpinLockAcquire(&ps_instrument->mutex);
InstrAggNode(&ps_instrument->instr, planstate->instrument);
SpinLockRelease(&ps_instrument->mutex);

And ExecParallelRetrieveInstrumentation is supposed to slurp those
counts back into the leader's PlanState objects:

/* No need to acquire the spinlock here; workers have exited already. */
ps_instrument = &instrumentation->ps_instrument[i];
InstrAggNode(planstate->instrument, &ps_instrument->instr);

This might be a race condition, or it might be just wrong logic.
Could you test what happens if you insert something like a 1-second
sleep in ExecParallelFinish just after the call to
WaitForParallelWorkersToFinish()? If that makes the results
consistent, this is a race. If it doesn't, something else is wrong:
then it would be useful to know whether the workers are actually
calling ExecParallelReportInstrumentation, and whether the leader is
actually calling ExecParallelRetrieveInstrumentation, and if so
whether they are doing it for the correct set of nodes.

Hmm.. I made the change, but clearly it's not sleeping properly with
my change (I'm expecting a total runtime in excess of 1 second):

max_parallel_degree = 4:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=49578.18..49578.19 rows=1 width=0) (actual
time=797.518..797.518 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=174883 read=540
-> Gather (cost=1000.00..49546.93 rows=12500 width=0) (actual
time=0.245..784.959 rows=55151 loops=1)
Output: content
Number of Workers: 4
Buffers: shared hit=174883 read=540
-> Parallel Seq Scan on public.js (cost=0.00..47296.93
rows=12500 width=0) (actual time=0.019..6153.679 rows=94503 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 2051330
Buffers: shared hit=299224 read=907
Planning time: 0.086 ms
Execution time: 803.026 ms

max_parallel_degree = 0:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->0->>'term' like 'design%' or
content->'tags'->0->>'term' like 'web%';

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=212867.43..212867.44 rows=1 width=0) (actual
time=1278.717..1278.717 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=174671 read=572
-> Seq Scan on public.js (cost=0.00..212836.18 rows=12500
width=0) (actual time=0.018..1272.030 rows=55151 loops=1)
Output: content
Filter: (((((js.content -> 'tags'::text) -> 0) ->>
'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
-> 0) ->> 'term'::text) ~~ 'web%'::text))
Rows Removed by Filter: 1197822
Buffers: shared hit=174671 read=572
Planning time: 0.064 ms
Execution time: 1278.741 ms
(10 rows)

Time: 1279.145 ms

I did, however, notice that repeated runs of the query with
max_parallel_degree = 4 yields different counts of rows removed by
filter:

Run 1: 2051330
Run 2: 2081252
Run 3: 2065112
Run 4: 2022045
Run 5: 2025384
Run 6: 2059360
Run 7: 2079620
Run 8: 2058541

Here's another oddity, with max_parallel_degree = 1:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%design%';
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=132489.34..132489.35 rows=1 width=0) (actual
time=382.987..382.987 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175288
-> Gather (cost=1000.00..132488.34 rows=401 width=0) (actual
time=382.983..382.983 rows=0 loops=1)
Output: content
Number of Workers: 1
Buffers: shared hit=175288
-> Parallel Seq Scan on public.js (cost=0.00..131448.24
rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)
Output: content
Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%design%'::text)
Rows Removed by Filter: 1724810
Buffers: shared hit=241201
Planning time: 0.104 ms
Execution time: 403.045 ms
(14 rows)

Time: 403.596 ms

The actual time of the sequential scan was 1141.437ms, but the total
execution time was 403.045ms.

And successive runs with max_parallel_degree = 1 also yield a
different number of rows removed by the filter, as well as a different
number of buffers being hit:

Run: rows removed / buffers hit
1: 1738517 / 243143
2: 1729361 / 241900
3: 1737168 / 242974
4: 1734440 / 242591

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#454Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#450)
Re: Parallel Seq Scan

On 2015/11/12 4:26, Robert Haas wrote:

On Wed, Nov 11, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Hmm, I see I was right about people finding more bugs once this was
committed. That didn't take long.

I encountered one more odd behavior:

postgres=# EXPLAIN ANALYZE SELECT abalance FROM pgbench_accounts WHERE aid
= 23466;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..65207.88 rows=1 width=4) (actual
time=17450.595..17451.151 rows=1 loops=1)
Number of Workers: 4
-> Parallel Seq Scan on pgbench_accounts (cost=0.00..64207.78 rows=1
width=4) (actual time=55.934..157001.134 rows=2 loops=1)
Filter: (aid = 23466)
Rows Removed by Filter: 18047484
Planning time: 0.198 ms
Execution time: 17453.565 ms
(7 rows)

The #rows removed here is almost twice the number of rows in the table
(10m). Also, the #rows selected shown is 2 for Parallel Seq Scan whereas
only 1 row is selected.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#455Amit Kapila
amit.kapila16@gmail.com
In reply to: Pavel Stehule (#447)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Hi

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are

differen

Thanks for the report. The reason for this problem is that instrumentation
information from workers is getting aggregated multiple times. In
ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
will wait for workers to finish and then accumulate stats from workers.
Now ExecShutdownGatherWorkers() could be called multiple times
(once we read all tuples from workers, at end of node) and it should be
ensured that repeated calls should not try to redo the work done by first
call.
The same is ensured for tuplequeues, but not for parallel executor info.
I think we can safely assume that we need to call ExecParallelFinish() only
when there are workers started by the Gathers node, so on those lines
attached patch should fix the problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_agg_instr_issue_v1.patchapplication/octet-stream; name=fix_agg_instr_issue_v1.patchDownload
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index b368b48..14b991f 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -403,11 +403,11 @@ ExecShutdownGatherWorkers(GatherState *node)
 		for (i = 0; i < node->nreaders; ++i)
 			DestroyTupleQueueReader(node->reader[i]);
 		node->reader = NULL;
-	}
 
-	/* Now shut down the workers. */
-	if (node->pei != NULL)
-		ExecParallelFinish(node->pei);
+		/* Now shut down the workers. */
+		if (node->pei != NULL)
+			ExecParallelFinish(node->pei);
+	}
 }
 
 /* ----------------------------------------------------------------
#456Thom Brown
thom@linux.com
In reply to: Amit Kapila (#455)
Re: Parallel Seq Scan

On 12 November 2015 at 15:23, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Hi

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Thanks for the report. The reason for this problem is that instrumentation
information from workers is getting aggregated multiple times. In
ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
will wait for workers to finish and then accumulate stats from workers.
Now ExecShutdownGatherWorkers() could be called multiple times
(once we read all tuples from workers, at end of node) and it should be
ensured that repeated calls should not try to redo the work done by first
call.
The same is ensured for tuplequeues, but not for parallel executor info.
I think we can safely assume that we need to call ExecParallelFinish() only
when there are workers started by the Gathers node, so on those lines
attached patch should fix the problem.

That fixes the count issue for me, although not the number of buffers
hit, or the actual time taken.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#457Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#456)
Re: Parallel Seq Scan

On Thu, Nov 12, 2015 at 9:05 PM, Thom Brown <thom@linux.com> wrote:

On 12 November 2015 at 15:23, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule <pavel.stehule@gmail.com

wrote:

Hi

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Thanks for the report. The reason for this problem is that

instrumentation

information from workers is getting aggregated multiple times. In
ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
will wait for workers to finish and then accumulate stats from workers.
Now ExecShutdownGatherWorkers() could be called multiple times
(once we read all tuples from workers, at end of node) and it should be
ensured that repeated calls should not try to redo the work done by

first

call.
The same is ensured for tuplequeues, but not for parallel executor info.
I think we can safely assume that we need to call ExecParallelFinish()

only

when there are workers started by the Gathers node, so on those lines
attached patch should fix the problem.

That fixes the count issue for me, although not the number of buffers
hit,

The number of shared buffers hit could be different across different runs
because the read sequence of parallel workers can't be guaranteed, also
I don't think same is even guaranteed for Seq Scan node, the other
operations
in parallel could lead to different number, however the actual problem was
that in one of the plans shown by you [1]# explain (analyse, buffers, timing, verbose, costs) select count(*) from js where content->'tags'->>'title' like '%design%'; QUERY PLAN ------------------------------------------------------------ ------------------------------------------------------------------------ Aggregate (cost=132489.34..132489.35 rows=1 width=0) (actual time=382.987..382.987 rows=1 loops=1) Output: count(*) Buffers: shared hit=175288 -> Gather (cost=1000.00..132488.34 rows=401 width=0) (actual time=382.983..382.983 rows=0 loops=1) Output: content Number of Workers: 1 Buffers: shared hit=175288 -> Parallel Seq Scan on public.js (cost=0.00..131448.24 rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1) Output: content Filter: (((js.content -> 'tags'::text) ->> 'title'::text) ~~ '%design%'::text) Rows Removed by Filter: 1724810 Buffers: shared hit=241201 Planning time: 0.104 ms Execution time: 403.045 ms (14 rows), the Buffers hit at Gather node
(175288) is lesser than the Buffers hit at Parallel Seq Scan node (241201).
Do you still (after applying above patch) see that Gather node is showing
lesser hit buffers than Parallel Seq Scan node?

[1]: # explain (analyse, buffers, timing, verbose, costs) select count(*) from js where content->'tags'->>'title' like '%design%'; QUERY PLAN ------------------------------------------------------------ ------------------------------------------------------------------------ Aggregate (cost=132489.34..132489.35 rows=1 width=0) (actual time=382.987..382.987 rows=1 loops=1) Output: count(*) Buffers: shared hit=175288 -> Gather (cost=1000.00..132488.34 rows=401 width=0) (actual time=382.983..382.983 rows=0 loops=1) Output: content Number of Workers: 1 Buffers: shared hit=175288 -> Parallel Seq Scan on public.js (cost=0.00..131448.24 rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1) Output: content Filter: (((js.content -> 'tags'::text) ->> 'title'::text) ~~ '%design%'::text) Rows Removed by Filter: 1724810 Buffers: shared hit=241201 Planning time: 0.104 ms Execution time: 403.045 ms (14 rows)
# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%design%';
QUERY
PLAN
------------------------------------------------------------
------------------------------------------------------------------------
Aggregate (cost=132489.34..132489.35 rows=1 width=0) (actual
time=382.987..382.987 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175288
-> Gather (cost=1000.00..132488.34 rows=401 width=0) (actual
time=382.983..382.983 rows=0 loops=1)
Output: content
Number of Workers: 1
Buffers: shared hit=175288
-> Parallel Seq Scan on public.js (cost=0.00..131448.24
rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)
Output: content
Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%design%'::text)
Rows Removed by Filter: 1724810
Buffers: shared hit=241201
Planning time: 0.104 ms
Execution time: 403.045 ms
(14 rows)

Time: 403.596 ms

or the actual time taken.

Exactly what time you are referring here, Execution Time or actual time
shown on Parallel Seq Scan node and what problem do you see with
the reported time?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#458Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#457)
Re: Parallel Seq Scan

On Thu, Nov 12, 2015 at 10:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The number of shared buffers hit could be different across different runs
because the read sequence of parallel workers can't be guaranteed, also
I don't think same is even guaranteed for Seq Scan node,

The number of hits could be different. However, it seems like any
sequential scan, parallel or not, should have a number of accesses
(hit + read) equal to the size of the relation. Not sure if that's
what is happening here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#459Thom Brown
thom@linux.com
In reply to: Amit Kapila (#457)
Re: Parallel Seq Scan

On 13 November 2015 at 03:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 12, 2015 at 9:05 PM, Thom Brown <thom@linux.com> wrote:

On 12 November 2015 at 15:23, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 11, 2015 at 11:29 PM, Pavel Stehule
<pavel.stehule@gmail.com>
wrote:

Hi

I have a first query

I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
differen

Thanks for the report. The reason for this problem is that
instrumentation
information from workers is getting aggregated multiple times. In
ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
will wait for workers to finish and then accumulate stats from workers.
Now ExecShutdownGatherWorkers() could be called multiple times
(once we read all tuples from workers, at end of node) and it should be
ensured that repeated calls should not try to redo the work done by
first
call.
The same is ensured for tuplequeues, but not for parallel executor info.
I think we can safely assume that we need to call ExecParallelFinish()
only
when there are workers started by the Gathers node, so on those lines
attached patch should fix the problem.

That fixes the count issue for me, although not the number of buffers
hit,

The number of shared buffers hit could be different across different runs
because the read sequence of parallel workers can't be guaranteed, also
I don't think same is even guaranteed for Seq Scan node, the other
operations
in parallel could lead to different number, however the actual problem was
that in one of the plans shown by you [1], the Buffers hit at Gather node
(175288) is lesser than the Buffers hit at Parallel Seq Scan node (241201).
Do you still (after applying above patch) see that Gather node is showing
lesser hit buffers than Parallel Seq Scan node?

Hmm... that's odd, I'm not seeing the problem now, so maybe I'm mistaken there.

[1]
# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%design%';
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=132489.34..132489.35 rows=1 width=0) (actual
time=382.987..382.987 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175288
-> Gather (cost=1000.00..132488.34 rows=401 width=0) (actual
time=382.983..382.983 rows=0 loops=1)
Output: content
Number of Workers: 1
Buffers: shared hit=175288
-> Parallel Seq Scan on public.js (cost=0.00..131448.24
rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)
Output: content
Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%design%'::text)
Rows Removed by Filter: 1724810
Buffers: shared hit=241201
Planning time: 0.104 ms
Execution time: 403.045 ms
(14 rows)

Time: 403.596 ms

or the actual time taken.

Exactly what time you are referring here, Execution Time or actual time
shown on Parallel Seq Scan node and what problem do you see with
the reported time?

I'm referring to the Parallel Seq Scan actual time, showing
"379.407..1141.437" with 1 worker, but the total execution time shows
403.045. If one worker is taking over a second, how come the whole
query was less than half a second?

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#460Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#458)
Re: Parallel Seq Scan

On Fri, Nov 13, 2015 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 12, 2015 at 10:39 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

The number of shared buffers hit could be different across different

runs

because the read sequence of parallel workers can't be guaranteed, also
I don't think same is even guaranteed for Seq Scan node,

The number of hits could be different. However, it seems like any
sequential scan, parallel or not, should have a number of accesses
(hit + read) equal to the size of the relation. Not sure if that's
what is happening here.

After patch provided above to fix the issue reported by Pavel, that is
the behaviour, but I think there are few more things which we might
want to consider, just refer the below plan:

Total pages in table
--------------------------------
postgres=# select relname,relpages from pg_class where relname like '%t2%';
relname | relpages
---------+----------
t2 | 5406
(1 row)

Parallel Plan
-----------------------------
postgres=# explain (analyze,buffers,timing) select count(*) from t2 where
c1 % 1
0 = 0;
QUERY PLAN

--------------------------------------------------------------------------------
------------------------------------------------
Aggregate (cost=8174.90..8174.91 rows=1 width=0) (actual
time=1055.294..1055.2
94 rows=1 loops=1)
Buffers: shared hit=446 read=5054
-> Gather (cost=0.00..8162.40 rows=5000 width=0) (actual
time=79.787..959.6
51 rows=100000 loops=1)
Number of Workers: 2
Buffers: shared hit=446 read=5054
-> Parallel Seq Scan on t2 (cost=0.00..8162.40 rows=5000
width=0) (ac
tual time=30.771..2518.844 rows=100000 loops=1)
Filter: ((c1 % 10) = 0)
Rows Removed by Filter: 900000
Buffers: shared hit=352 read=5054
Planning time: 0.170 ms
Execution time: 1059.400 ms
(11 rows)

Lets focus on Buffers and actual time in the above plan:

Buffers - At Parallel Seq Scan node, it shows total of 5406 (352+5054)
buffers which tallys with what is expected. However at Gather node,
it shows 5500 (446+5054) and the reason for the same is that we
accumulate overall buffer usage for parallel execution of worker which
includes start of node as well, refer ParallelQueryMain() and when the
that gets counted even towards the buffer calculation of Gather node.
The theory behind collecting overall buffer usage for parallel execution
was that we need it for pg_stat_statements where the stats is accumulated
for overall execution not on node-by-node basis refer queryDesc->totaltime
usage in standard_ExecutorRun().
I think here we need to decide what is the right value to display at
Gather node:
1. Display the same number of buffers at Gather node as at Parallel
Seq Scan node.
2. Display the number of buffers at Parallel Seq Scan node plus the
additional buffers used by parallel workers for miscellaneous work
like ExecutorStart(), etc.
3. Don't account for buffers used for parallel workers.
4. Anything better?

Also in conjuction with above, we need to see what should be accounted for
pg_stat_statements?

actual_time -
Actual time at Gather node: actual time = 79.787..959.651
Actual time at Parallel Seq Scan node = 30.771..2518.844

Time at Parallel Seq Scan node is more than time at Gather node as
the time for parallel workers is also accumulated for Parallel Seq Scan
node, whereas some doesn't get accounted for Gather node. Now it
could be confusing for users because time displayed at Parallel Seq
Scan node will be equal to - time_taken_by_worker-1 +
time_taken_by_worker-2 + ...
This time could be more than the actual time taken by query because each
worker is getting executed parallely but we have accounted the time such
that each one is executing serially.
I think the time for fetching the tuples from workers is already accounted
for Gather node, so may be for Parallel Seq Scan node we can omit
adding the time for each of the parallel workers.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#461Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#459)
Re: Parallel Seq Scan

On Fri, Nov 13, 2015 at 6:17 PM, Thom Brown <thom@linux.com> wrote:

The number of shared buffers hit could be different across different

runs

because the read sequence of parallel workers can't be guaranteed, also
I don't think same is even guaranteed for Seq Scan node, the other
operations
in parallel could lead to different number, however the actual problem

was

that in one of the plans shown by you [1], the Buffers hit at Gather

node

(175288) is lesser than the Buffers hit at Parallel Seq Scan node

(241201).

Do you still (after applying above patch) see that Gather node is

showing

lesser hit buffers than Parallel Seq Scan node?

Hmm... that's odd, I'm not seeing the problem now, so maybe I'm mistaken

there.

Thanks for confirming the same.

or the actual time taken.

Exactly what time you are referring here, Execution Time or actual time
shown on Parallel Seq Scan node and what problem do you see with
the reported time?

I'm referring to the Parallel Seq Scan actual time, showing
"379.407..1141.437" with 1 worker, but the total execution time shows
403.045. If one worker is taking over a second, how come the whole
query was less than half a second?

Yeah, this could be possible due to the way currently time is accumulated,
see my mail which I sent just before this mail. I think we might need to do
something, else it could be confusing for users.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#462Amit Kapila
amit.kapila16@gmail.com
In reply to: Pavel Stehule (#449)
Re: Parallel Seq Scan

On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

yes - the another little bit unclean in EXPLAIN is number of workers. If I
understand to the behave, the query is processed by two processes if
workers in the explain is one.

You are right and I think that is current working model of Gather
node which seems okay. I think the more serious thing here
is that there is possibility that Explain Analyze can show the
number of workers as more than actual workers working for Gather
node. We have already discussed that Explain Analyze should
the actual number of workers used in query execution, patch for
the same is still pending.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#463Thom Brown
thom@linux.com
In reply to: Amit Kapila (#462)
Re: Parallel Seq Scan

On 13 November 2015 at 13:38, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

yes - the another little bit unclean in EXPLAIN is number of workers. If I
understand to the behave, the query is processed by two processes if workers
in the explain is one.

You are right and I think that is current working model of Gather
node which seems okay. I think the more serious thing here
is that there is possibility that Explain Analyze can show the
number of workers as more than actual workers working for Gather
node. We have already discussed that Explain Analyze should
the actual number of workers used in query execution, patch for
the same is still pending.

This may have already been discussed before, but in a verbose output,
would it be possible to see the nodes for each worker?

e.g.

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%de%';
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=105557.59..105557.60 rows=1 width=0) (actual
time=400.752..400.752 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=175333
-> Gather (cost=1000.00..104931.04 rows=250621 width=0) (actual
time=400.748..400.748 rows=0 loops=1)
Output: content
Number of Workers: 2
Buffers: shared hit=175333
-> Parallel Seq Scan on public.js (cost=0.00..39434.47
rows=125310 width=0) (actual time=182.256..398.14 rows=0 loops=1)
Output: content
Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%de%'::text)
Rows Removed by Filter: 626486
Buffers: shared hit=87666
-> Parallel Seq Scan on public.js (cost=0.00..39434.47
rows=1253101 width=0) (actual time=214.11..325.31 rows=0 loops=1)
Output: content
Filter: (((js.content -> 'tags'::text) ->>
'title'::text) ~~ '%de%'::text)
Rows Removed by Filter: 6264867
Buffers: shared hit=876667
Planning time: 0.085 ms
Execution time: 414.713 ms
(14 rows)

And perhaps associated PIDs?

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#464Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#463)
Re: Parallel Seq Scan

On Fri, Nov 13, 2015 at 7:59 PM, Thom Brown <thom@linux.com> wrote:

On 13 November 2015 at 13:38, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule <pavel.stehule@gmail.com

wrote:

yes - the another little bit unclean in EXPLAIN is number of workers.

If I

understand to the behave, the query is processed by two processes if

workers

in the explain is one.

You are right and I think that is current working model of Gather
node which seems okay. I think the more serious thing here
is that there is possibility that Explain Analyze can show the
number of workers as more than actual workers working for Gather
node. We have already discussed that Explain Analyze should
the actual number of workers used in query execution, patch for
the same is still pending.

This may have already been discussed before, but in a verbose output,
would it be possible to see the nodes for each worker?

There will be hardly any difference in nodes for each worker and it could
be very long plan for large number of workers. What kind of additional
information you want which can't be shown in current format.

And perhaps associated PIDs?

Yeah, that can be useful, if others also feel like it is important, I can
look into preparing a patch for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#465Thom Brown
thom@linux.com
In reply to: Amit Kapila (#464)
Re: Parallel Seq Scan

On 13 November 2015 at 15:22, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 13, 2015 at 7:59 PM, Thom Brown <thom@linux.com> wrote:

On 13 November 2015 at 13:38, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 11, 2015 at 11:40 PM, Pavel Stehule
<pavel.stehule@gmail.com>
wrote:

yes - the another little bit unclean in EXPLAIN is number of workers.
If I
understand to the behave, the query is processed by two processes if
workers
in the explain is one.

You are right and I think that is current working model of Gather
node which seems okay. I think the more serious thing here
is that there is possibility that Explain Analyze can show the
number of workers as more than actual workers working for Gather
node. We have already discussed that Explain Analyze should
the actual number of workers used in query execution, patch for
the same is still pending.

This may have already been discussed before, but in a verbose output,
would it be possible to see the nodes for each worker?

There will be hardly any difference in nodes for each worker and it could
be very long plan for large number of workers. What kind of additional
information you want which can't be shown in current format.

For explain plans, not that useful, but it's useful to see how long
each worker took for explain analyse. And I imagine as more
functionality is added to scan partitions and foreign scans, it will
perhaps be more useful when the plans won't be identical. (or would
they?)

And perhaps associated PIDs?

Yeah, that can be useful, if others also feel like it is important, I can
look into preparing a patch for the same.

Thanks.

Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#466Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#443)
Re: Parallel Seq Scan

On Wed, Nov 11, 2015 at 6:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I've committed most of this, except for some planner bits that I
didn't like, and after a bunch of cleanup. Instead, I committed the
consider-parallel-v2.patch with some additional planner bits to make
up for the ones I removed from your patch. So, now we have parallel
sequential scan!

Pretty cool. All I had to do is mark my slow plperl functions as
being parallel safe, and bang, parallel execution of them for seq
scans.

But, there does seem to be a memory leak.

The setup (warning: 20GB of data):

create table foobar as select md5(floor(random()*1500000)::text) as
id, random() as volume
from generate_series(1,200000000);

set max_parallel_degree TO 8;

explain select count(*) from foobar where volume >0.9;
QUERY PLAN
---------------------------------------------------------------------------------------
Aggregate (cost=2626202.44..2626202.45 rows=1 width=0)
-> Gather (cost=1000.00..2576381.76 rows=19928272 width=0)
Number of Workers: 7
-> Parallel Seq Scan on foobar (cost=0.00..582554.56
rows=19928272 width=0)
Filter: (volume > '0.9'::double precision)

Now running this query leads to an OOM condition:

explain (analyze, buffers) select count(*) from foobar where volume >0.9;
WARNING: terminating connection because of crash of another server process

Running it without the explain also causes the problem.

Memory dump looks like at some point before the crash looks like:

TopMemoryContext: 62496 total in 9 blocks; 16976 free (60 chunks); 45520 used
TopTransactionContext: 8192 total in 1 blocks; 4024 free (8 chunks); 4168 used
ExecutorState: 1795153920 total in 223 blocks; 4159872 free (880
chunks); 1790994048 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
....other insignificant stuff...

I don't have enough RAM for each of 7 workers to use all that much more than 2GB

work_mem is 25MB, maintenance work_mem is 64MB

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#467Amit Kapila
amit.kapila16@gmail.com
In reply to: Thom Brown (#465)
Re: Parallel Seq Scan

On Fri, Nov 13, 2015 at 9:16 PM, Thom Brown <thom@linux.com> wrote:

On 13 November 2015 at 15:22, Amit Kapila <amit.kapila16@gmail.com> wrote:

There will be hardly any difference in nodes for each worker and it

could

be very long plan for large number of workers. What kind of additional
information you want which can't be shown in current format.

For explain plans, not that useful, but it's useful to see how long
each worker took for explain analyse.

The statistics related to buffers, timing and infact rows filtered will be
different for each worker, so it sounds sensible to me to display separate
plan info for each worker or at least display the same in verbose or some
other mode and then display aggregated information at Gather node. The
only point that needs more thought is that parallel plans will look big for
many number of workers. I think this will need somewhat substantial
changes than what is done currently for parallel seq scan, so it is better
if others also share their opinion about this form of display of information
for parallel queries.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#468Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#466)
2 attachment(s)
Re: Parallel Seq Scan

On Fri, Nov 13, 2015 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Nov 11, 2015 at 6:53 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

I've committed most of this, except for some planner bits that I
didn't like, and after a bunch of cleanup. Instead, I committed the
consider-parallel-v2.patch with some additional planner bits to make
up for the ones I removed from your patch. So, now we have parallel
sequential scan!

Pretty cool. All I had to do is mark my slow plperl functions as
being parallel safe, and bang, parallel execution of them for seq
scans.

But, there does seem to be a memory leak.

Thanks for the report.

I think main reason of the leak in workers seems to be due the reason
that one of the buffer used while sending tuples (in
function BuildRemapInfo)
from worker to master is not getting freed and it is allocated for each
tuple worker sends back to master. I couldn't find use of such a buffer,
so I think we can avoid the allocation of same or atleast we need to free
it. Attached patch remove_unused_buf_allocation_v1.patch should fix the
issue.

Another thing I have noticed is that we need to build the remap info
target list contains record type of attrs, so ideally it should not even go
in
this path when such attrs are not present. The reason for the same was
that the tuple descriptor stored in TQueueDestReceiver was not updated,
attached patch fix_initialization_tdesc_v1 fixes this issue.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

remove_unused_buf_allocation_v1.patchapplication/octet-stream; name=remove_unused_buf_allocation_v1.patchDownload
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index d68666c..5735acf 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -892,9 +892,6 @@ BuildRemapInfo(TupleDesc tupledesc)
 	Size		size;
 	AttrNumber	i;
 	bool		noop = true;
-	StringInfoData buf;
-
-	initStringInfo(&buf);
 
 	size = offsetof(RemapInfo, mapping) +
 		sizeof(RemapClass) * tupledesc->natts;
@@ -917,7 +914,6 @@ BuildRemapInfo(TupleDesc tupledesc)
 
 	if (noop)
 	{
-		appendStringInfo(&buf, "noop");
 		pfree(remapinfo);
 		remapinfo = NULL;
 	}
fix_initialization_tdesc_v1.patchapplication/octet-stream; name=fix_initialization_tdesc_v1.patchDownload
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 5735acf..08e0da1 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -131,11 +131,13 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 	 * adopt it here as well.
 	 */
 	if (tqueue->tupledesc != tupledesc ||
-		tqueue->remapinfo->natts != tupledesc->natts)
+		(tqueue->remapinfo &&
+		 tqueue->remapinfo->natts != tupledesc->natts))
 	{
 		if (tqueue->remapinfo != NULL)
 			pfree(tqueue->remapinfo);
 		tqueue->remapinfo = BuildRemapInfo(tupledesc);
+		tqueue->tupledesc = tupledesc;
 	}
 
 	tuple = ExecMaterializeSlot(slot);
#469Robert Haas
robertmhaas@gmail.com
In reply to: Thom Brown (#465)
Re: Parallel Seq Scan

On Fri, Nov 13, 2015 at 10:46 AM, Thom Brown <thom@linux.com> wrote:

And perhaps associated PIDs?

Yeah, that can be useful, if others also feel like it is important, I can
look into preparing a patch for the same.

Thanks.

Thom, what do you think the EXPLAIN output should look like,
specifically? Or anyone else who feels like answering.

I don't think it would be very useful to repeat the entire EXPLAIN
output n times, once per worker. That sounds like a loser. But we
could add additional lines to the output for each node, like this:

Parallel Seq Scan on foo (cost=0.00..XXX rows=YYY width=ZZZ) (actual
time=AAA..BBB rows=CCC loops=1)
Leader: actual time=AAA..BBB rows=CCC loops=1
Worker 0: actual time=AAA..BBB rows=CCC loops=1
Worker 1: actual time=AAA..BBB rows=CCC loops=1
Worker 2: actual time=AAA..BBB rows=CCC loops=1

If "buffers" is specified, we could display the summary information
after the Parallel Seq Scan as normal and then display an additional
per-worker line after the "Leader" line and each "Worker N" line. I
think displaying the worker index is more useful than displaying the
PID, especially if we think that a plan tree like this might ever get
executed multiple times with different PIDs on each pass.

Like? Dislike? Other ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#470Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Robert Haas (#469)
Re: Parallel Seq Scan

On 16/11/15 12:05, Robert Haas wrote:

On Fri, Nov 13, 2015 at 10:46 AM, Thom Brown <thom@linux.com> wrote:

And perhaps associated PIDs?

Yeah, that can be useful, if others also feel like it is important, I can
look into preparing a patch for the same.

Thanks.

Thom, what do you think the EXPLAIN output should look like,
specifically? Or anyone else who feels like answering.

I don't think it would be very useful to repeat the entire EXPLAIN
output n times, once per worker. That sounds like a loser. But we
could add additional lines to the output for each node, like this:

Parallel Seq Scan on foo (cost=0.00..XXX rows=YYY width=ZZZ) (actual
time=AAA..BBB rows=CCC loops=1)
Leader: actual time=AAA..BBB rows=CCC loops=1
Worker 0: actual time=AAA..BBB rows=CCC loops=1
Worker 1: actual time=AAA..BBB rows=CCC loops=1
Worker 2: actual time=AAA..BBB rows=CCC loops=1

If "buffers" is specified, we could display the summary information
after the Parallel Seq Scan as normal and then display an additional
per-worker line after the "Leader" line and each "Worker N" line. I
think displaying the worker index is more useful than displaying the
PID, especially if we think that a plan tree like this might ever get
executed multiple times with different PIDs on each pass.

Like? Dislike? Other ideas?

Possibly have an option to include the PID?

Consider altering the format field width of the Worker number (depending
on the number of workers) so you don't get:
Worker 9 ...
Worker 10 ...
but something like
Worker 9 ...
Worker 10 ...

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#471Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#468)
Re: Parallel Seq Scan

On Sun, Nov 15, 2015 at 1:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks for the report.

I think main reason of the leak in workers seems to be due the reason
that one of the buffer used while sending tuples (in function
BuildRemapInfo)
from worker to master is not getting freed and it is allocated for each
tuple worker sends back to master. I couldn't find use of such a buffer,
so I think we can avoid the allocation of same or atleast we need to free
it. Attached patch remove_unused_buf_allocation_v1.patch should fix the
issue.

Oops. Committed.

Another thing I have noticed is that we need to build the remap info
target list contains record type of attrs, so ideally it should not even go
in
this path when such attrs are not present. The reason for the same was
that the tuple descriptor stored in TQueueDestReceiver was not updated,
attached patch fix_initialization_tdesc_v1 fixes this issue.

I don't understand this part.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#472Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#469)
Re: Parallel Seq Scan

On Mon, Nov 16, 2015 at 4:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Nov 13, 2015 at 10:46 AM, Thom Brown <thom@linux.com> wrote:

And perhaps associated PIDs?

Yeah, that can be useful, if others also feel like it is important, I

can

look into preparing a patch for the same.

Thanks.

Thom, what do you think the EXPLAIN output should look like,
specifically? Or anyone else who feels like answering.

I don't think it would be very useful to repeat the entire EXPLAIN
output n times, once per worker. That sounds like a loser.

Yes, it doesn't seem good idea to repeat the information, but what
about the cases when different workers perform scan on different
relations (partitions in case of Append node) or may be performs a
different operation in Sort or join node parallelism.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#473Pavel Stehule
pavel.stehule@gmail.com
In reply to: Amit Kapila (#472)
Re: Parallel Seq Scan

2015-11-16 14:17 GMT+01:00 Amit Kapila <amit.kapila16@gmail.com>:

On Mon, Nov 16, 2015 at 4:35 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Fri, Nov 13, 2015 at 10:46 AM, Thom Brown <thom@linux.com> wrote:

And perhaps associated PIDs?

Yeah, that can be useful, if others also feel like it is important, I

can

look into preparing a patch for the same.

Thanks.

Thom, what do you think the EXPLAIN output should look like,
specifically? Or anyone else who feels like answering.

I don't think it would be very useful to repeat the entire EXPLAIN
output n times, once per worker. That sounds like a loser.

Yes, it doesn't seem good idea to repeat the information, but what
about the cases when different workers perform scan on different
relations (partitions in case of Append node) or may be performs a
different operation in Sort or join node parallelism.

+1

Pavel

Show quoted text

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#474Jeff Janes
jeff.janes@gmail.com
In reply to: Amit Kapila (#468)
Re: Parallel Seq Scan

On Sat, Nov 14, 2015 at 10:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 13, 2015 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Nov 11, 2015 at 6:53 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

I've committed most of this, except for some planner bits that I
didn't like, and after a bunch of cleanup. Instead, I committed the
consider-parallel-v2.patch with some additional planner bits to make
up for the ones I removed from your patch. So, now we have parallel
sequential scan!

Pretty cool. All I had to do is mark my slow plperl functions as
being parallel safe, and bang, parallel execution of them for seq
scans.

But, there does seem to be a memory leak.

Thanks for the report.

I think main reason of the leak in workers seems to be due the reason
that one of the buffer used while sending tuples (in function
BuildRemapInfo)
from worker to master is not getting freed and it is allocated for each
tuple worker sends back to master. I couldn't find use of such a buffer,
so I think we can avoid the allocation of same or atleast we need to free
it. Attached patch remove_unused_buf_allocation_v1.patch should fix the
issue.

Thanks, that patch (as committed) has fixed the problem for me. I
don't understand the second one.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#475Bert
biertie@gmail.com
In reply to: Jeff Janes (#474)
Re: Parallel Seq Scan

Hey,

I've just pulled and compiled the new code.
I'm running a TPC-DS like test on different PostgreSQL installations, but
running (max) 12queries in parallel on a server with 12cores.
I've configured max_parallel_degree to 2, and I get messages that backend
processes crash.
I am running the same test now with 6queries in parallel, and parallel
degree to 2, and they seem to work. for now. :)

This is the output I get in /var/log/messages
Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at
7fa3437bf104 ip 0000000000490b56 sp 00007ffdf2f083a0 error 6 in
postgres[400000+5b5000]

Is there something else I should get?

cheers,
Bert

On Mon, Nov 16, 2015 at 6:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sat, Nov 14, 2015 at 10:12 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Nov 13, 2015 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com>

wrote:

On Wed, Nov 11, 2015 at 6:53 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

I've committed most of this, except for some planner bits that I
didn't like, and after a bunch of cleanup. Instead, I committed the
consider-parallel-v2.patch with some additional planner bits to make
up for the ones I removed from your patch. So, now we have parallel
sequential scan!

Pretty cool. All I had to do is mark my slow plperl functions as
being parallel safe, and bang, parallel execution of them for seq
scans.

But, there does seem to be a memory leak.

Thanks for the report.

I think main reason of the leak in workers seems to be due the reason
that one of the buffer used while sending tuples (in function
BuildRemapInfo)
from worker to master is not getting freed and it is allocated for each
tuple worker sends back to master. I couldn't find use of such a buffer,
so I think we can avoid the allocation of same or atleast we need to free
it. Attached patch remove_unused_buf_allocation_v1.patch should fix the
issue.

Thanks, that patch (as committed) has fixed the problem for me. I
don't understand the second one.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bert Desmet
0477/305361

#476Robert Haas
robertmhaas@gmail.com
In reply to: Bert (#475)
Re: Parallel Seq Scan

On Mon, Nov 16, 2015 at 2:51 PM, Bert <biertie@gmail.com> wrote:

I've just pulled and compiled the new code.
I'm running a TPC-DS like test on different PostgreSQL installations, but
running (max) 12queries in parallel on a server with 12cores.
I've configured max_parallel_degree to 2, and I get messages that backend
processes crash.
I am running the same test now with 6queries in parallel, and parallel
degree to 2, and they seem to work. for now. :)

This is the output I get in /var/log/messages
Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at 7fa3437bf104
ip 0000000000490b56 sp 00007ffdf2f083a0 error 6 in postgres[400000+5b5000]

Is there something else I should get?

Can you enable core dumps e.g. by passing the -c option to pg_ctl
start? If you can get a core file, you can then get a backtrace
using:

gdb /path/to/postgres /path/to/core
bt full
q

That should be enough to find and fix whatever the bug is. Thanks for testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#477Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#471)
Re: Parallel Seq Scan

On Mon, Nov 16, 2015 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Nov 15, 2015 at 1:12 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Thanks for the report.

I think main reason of the leak in workers seems to be due the reason
that one of the buffer used while sending tuples (in function
BuildRemapInfo)
from worker to master is not getting freed and it is allocated for each
tuple worker sends back to master. I couldn't find use of such a

buffer,

so I think we can avoid the allocation of same or atleast we need to

free

it. Attached patch remove_unused_buf_allocation_v1.patch should fix the
issue.

Oops. Committed.

Thanks!

Another thing I have noticed is that we need to build the remap info
target list contains record type of attrs, so ideally it should not

even go

in
this path when such attrs are not present. The reason for the same was
that the tuple descriptor stored in TQueueDestReceiver was not updated,
attached patch fix_initialization_tdesc_v1 fixes this issue.

I don't understand this part.

The code in question is as below:

tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
{
..

if (tqueue->tupledesc != tupledesc ||

tqueue->remapinfo->natts != tupledesc->natts)

{

if (tqueue->remapinfo != NULL)

pfree(tqueue->remapinfo);

tqueue->remapinfo = BuildRemapInfo(tupledesc);

}
..
}

Here the above check always passes as tqueue->tupledesc is not
set due to which it always try to build remap info. Is there any reason
for doing so?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#478Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Janes (#474)
Re: Parallel Seq Scan

On Mon, Nov 16, 2015 at 10:36 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sat, Nov 14, 2015 at 10:12 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Nov 13, 2015 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com>

wrote:

I think main reason of the leak in workers seems to be due the reason
that one of the buffer used while sending tuples (in function
BuildRemapInfo)
from worker to master is not getting freed and it is allocated for each
tuple worker sends back to master. I couldn't find use of such a

buffer,

so I think we can avoid the allocation of same or atleast we need to

free

it. Attached patch remove_unused_buf_allocation_v1.patch should fix the
issue.

Thanks, that patch (as committed) has fixed the problem for me.

Thanks to you as well for verification.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#479Bert
biertie@gmail.com
In reply to: Robert Haas (#476)
Re: Parallel Seq Scan

Hi,

this is the backtrace:
gdb /var/lib/pgsql/9.6/data/ /var/lib/pgsql/9.6/data/core.7877
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html

This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/&gt;...
/var/lib/pgsql/9.6/data/: Success.
[New LWP 7877]
Missing separate debuginfo for the main executable file
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/02/20b77a9ab8f607b0610082794165fccedf210d
Core was generated by `postgres: postgres tpcds [loca'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000490b56 in ?? ()
(gdb) bt full
#0 0x0000000000490b56 in ?? ()
No symbol table info available.
#1 0x0000000000003668 in ?? ()
No symbol table info available.
#2 0x00007f956249a008 in ?? ()
No symbol table info available.
#3 0x000000000228c498 in ?? ()
No symbol table info available.
#4 0x0000000000000001 in ?? ()
No symbol table info available.
#5 0x000000000228ad00 in ?? ()
No symbol table info available.
#6 0x0000000000493fdf in ?? ()
No symbol table info available.
#7 0x00000000021a8e50 in ?? ()
No symbol table info available.
#8 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) q

Is there something else I can do?

On Mon, Nov 16, 2015 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 16, 2015 at 2:51 PM, Bert <biertie@gmail.com> wrote:

I've just pulled and compiled the new code.
I'm running a TPC-DS like test on different PostgreSQL installations, but
running (max) 12queries in parallel on a server with 12cores.
I've configured max_parallel_degree to 2, and I get messages that backend
processes crash.
I am running the same test now with 6queries in parallel, and parallel
degree to 2, and they seem to work. for now. :)

This is the output I get in /var/log/messages
Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at

7fa3437bf104

ip 0000000000490b56 sp 00007ffdf2f083a0 error 6 in

postgres[400000+5b5000]

Is there something else I should get?

Can you enable core dumps e.g. by passing the -c option to pg_ctl
start? If you can get a core file, you can then get a backtrace
using:

gdb /path/to/postgres /path/to/core
bt full
q

That should be enough to find and fix whatever the bug is. Thanks for
testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Bert Desmet
0477/305361

#480Bert
biertie@gmail.com
In reply to: Bert (#479)
Re: Parallel Seq Scan

edit: maybe this is more useful? :)

(gdb) bt full
#0 0x0000000000490b56 in heap_parallelscan_nextpage ()
No symbol table info available.
#1 0x0000000000493fdf in heap_getnext ()
No symbol table info available.
#2 0x00000000005c0733 in SeqNext ()
No symbol table info available.
#3 0x00000000005ac5d9 in ExecScan ()
No symbol table info available.
#4 0x00000000005a5c08 in ExecProcNode ()
No symbol table info available.
#5 0x00000000005b5298 in ExecGather ()
No symbol table info available.
#6 0x00000000005a5aa8 in ExecProcNode ()
No symbol table info available.
#7 0x00000000005b68b9 in MultiExecHash ()
No symbol table info available.
#8 0x00000000005b7256 in ExecHashJoin ()
No symbol table info available.
#9 0x00000000005a5b18 in ExecProcNode ()
No symbol table info available.
#10 0x00000000005b0ac9 in fetch_input_tuple ()
No symbol table info available.
#11 0x00000000005b1eaf in ExecAgg ()
No symbol table info available.
#12 0x00000000005a5ad8 in ExecProcNode ()
No symbol table info available.
#13 0x00000000005c11e1 in ExecSort ()
No symbol table info available.
#14 0x00000000005a5af8 in ExecProcNode ()
No symbol table info available.
#15 0x00000000005ba164 in ExecLimit ()
No symbol table info available.
#16 0x00000000005a5a38 in ExecProcNode ()
No symbol table info available.
#17 0x00000000005a2343 in standard_ExecutorRun ()
No symbol table info available.
#18 0x000000000069cb08 in PortalRunSelect ()
No symbol table info available.
#19 0x000000000069de5f in PortalRun ()
No symbol table info available.
#20 0x000000000069bc16 in PostgresMain ()
No symbol table info available.
#21 0x0000000000466f55 in ServerLoop ()
No symbol table info available.
#22 0x0000000000648436 in PostmasterMain ()
No symbol table info available.
#23 0x00000000004679f0 in main ()
No symbol table info available.

On Tue, Nov 17, 2015 at 12:38 PM, Bert <biertie@gmail.com> wrote:

Hi,

this is the backtrace:
gdb /var/lib/pgsql/9.6/data/ /var/lib/pgsql/9.6/data/core.7877
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
http://gnu.org/licenses/gpl.html&gt;
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/&gt;...
/var/lib/pgsql/9.6/data/: Success.
[New LWP 7877]
Missing separate debuginfo for the main executable file
Try: yum --enablerepo='*debug*' install
/usr/lib/debug/.build-id/02/20b77a9ab8f607b0610082794165fccedf210d
Core was generated by `postgres: postgres tpcds [loca'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000490b56 in ?? ()
(gdb) bt full
#0 0x0000000000490b56 in ?? ()
No symbol table info available.
#1 0x0000000000003668 in ?? ()
No symbol table info available.
#2 0x00007f956249a008 in ?? ()
No symbol table info available.
#3 0x000000000228c498 in ?? ()
No symbol table info available.
#4 0x0000000000000001 in ?? ()
No symbol table info available.
#5 0x000000000228ad00 in ?? ()
No symbol table info available.
#6 0x0000000000493fdf in ?? ()
No symbol table info available.
#7 0x00000000021a8e50 in ?? ()
No symbol table info available.
#8 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb) q

Is there something else I can do?

On Mon, Nov 16, 2015 at 8:59 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Mon, Nov 16, 2015 at 2:51 PM, Bert <biertie@gmail.com> wrote:

I've just pulled and compiled the new code.
I'm running a TPC-DS like test on different PostgreSQL installations,

but

running (max) 12queries in parallel on a server with 12cores.
I've configured max_parallel_degree to 2, and I get messages that

backend

processes crash.
I am running the same test now with 6queries in parallel, and parallel
degree to 2, and they seem to work. for now. :)

This is the output I get in /var/log/messages
Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at

7fa3437bf104

ip 0000000000490b56 sp 00007ffdf2f083a0 error 6 in

postgres[400000+5b5000]

Is there something else I should get?

Can you enable core dumps e.g. by passing the -c option to pg_ctl
start? If you can get a core file, you can then get a backtrace
using:

gdb /path/to/postgres /path/to/core
bt full
q

That should be enough to find and fix whatever the bug is. Thanks for
testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Bert Desmet
0477/305361

--
Bert Desmet
0477/305361

#481Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#477)
1 attachment(s)
Re: Parallel Seq Scan

On Mon, Nov 16, 2015 at 9:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I don't understand this part.

The code in question is as below:

tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)

{
..

if (tqueue->tupledesc != tupledesc ||

tqueue->remapinfo->natts != tupledesc->natts)

{

if (tqueue->remapinfo != NULL)

pfree(tqueue->remapinfo);

tqueue->remapinfo = BuildRemapInfo(tupledesc);

}

..
}

Here the above check always passes as tqueue->tupledesc is not
set due to which it always try to build remap info. Is there any reason
for doing so?

Groan. The problem here is that tqueue->tupledesc never gets set. I
think this should be fixed as in the attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

init-tqueue-tupledesc.patchtext/x-diff; charset=US-ASCII; name=init-tqueue-tupledesc.patchDownload
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 5735acf..d625b0d 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -127,15 +127,15 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 	 * new tupledesc.  This is a strange test both because the executor really
 	 * shouldn't change the tupledesc, and also because it would be unsafe if
 	 * the old tupledesc could be freed and a new one allocated at the same
-	 * address.  But since some very old code in printtup.c uses this test, we
-	 * adopt it here as well.
+	 * address.  But since some very old code in printtup.c uses a similar
+	 * test, we adopt it here as well.
 	 */
-	if (tqueue->tupledesc != tupledesc ||
-		tqueue->remapinfo->natts != tupledesc->natts)
+	if (tqueue->tupledesc != tupledesc)
 	{
 		if (tqueue->remapinfo != NULL)
 			pfree(tqueue->remapinfo);
 		tqueue->remapinfo = BuildRemapInfo(tupledesc);
+		tqueue->tupledesc = tupledesc;
 	}
 
 	tuple = ExecMaterializeSlot(slot);
#482Robert Haas
robertmhaas@gmail.com
In reply to: Bert (#480)
Re: Parallel Seq Scan

On Tue, Nov 17, 2015 at 6:52 AM, Bert <biertie@gmail.com> wrote:

edit: maybe this is more useful? :)

Definitely. But if you've built with --enable-debug and not stripped
the resulting executable, we ought to get line numbers as well, plus
the arguments to each function on the stack. That would help a lot
more. The only things that get dereferenced in that function are
"scan" and "parallel_scan", so it's a good bet that one of those
pointers is pointing off into never-never land. I can't immediately
guess how that's happening, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#483Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#455)
Re: Parallel Seq Scan

On Thu, Nov 12, 2015 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks for the report. The reason for this problem is that instrumentation
information from workers is getting aggregated multiple times. In
ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
will wait for workers to finish and then accumulate stats from workers.
Now ExecShutdownGatherWorkers() could be called multiple times
(once we read all tuples from workers, at end of node) and it should be
ensured that repeated calls should not try to redo the work done by first
call.
The same is ensured for tuplequeues, but not for parallel executor info.
I think we can safely assume that we need to call ExecParallelFinish() only
when there are workers started by the Gathers node, so on those lines
attached patch should fix the problem.

I suggest that we instead fix ExecParallelFinish() to be idempotent.
Add a "bool finished" flag to ParallelExecutorInfo and return at once
if it's already set. Get rid of the exposed
ExecParallelReinitializeTupleQueues() interface and have
ExecParallelReinitialize(pei) instead. Have that call
ReinitializeParallelDSM(), ExecParallelSetupTupleQueues(pei->pcxt,
true), and set pei->finished = false. I think that would give us a
slightly cleaner separation of concerns between nodeGather.c and
execParallel.c.

Your fix seems a little fragile. You're relying on node->reader !=
NULL to tell you whether the readers need to be cleaned up, but in
fact node->reader is set to a non-NULL value AFTER the pei has been
created. Granted, we currently always create a reader unless we don't
get any workers, and if we don't get any workers then failing to call
ExecParallelFinish is currently harmless, but nonetheless I think we
should be more explicit about this so it doesn't accidentally get
broken later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#484Bert
biertie@gmail.com
In reply to: Robert Haas (#482)
Re: Parallel Seq Scan

Hey Robert,

Thank you for the help. As you might (not) know, I'm quite new to the
community, but I'm learning. with the help from people like you.
anyhow, find attached a third attempt to a valid backtrace file.

This run is compiled from commit 5f10b7a604c87fc61a2c20a56552301f74c9bd5f
and your latest patch atteched in this mailtrack.

cheers,
Bert​
full_backtrace.log
<https://drive.google.com/file/d/0B_qnY25RovTmM0NtdkNSejByVGs/view?usp=drive_web&gt;

On Tue, Nov 17, 2015 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 17, 2015 at 6:52 AM, Bert <biertie@gmail.com> wrote:

edit: maybe this is more useful? :)

Definitely. But if you've built with --enable-debug and not stripped
the resulting executable, we ought to get line numbers as well, plus
the arguments to each function on the stack. That would help a lot
more. The only things that get dereferenced in that function are
"scan" and "parallel_scan", so it's a good bet that one of those
pointers is pointing off into never-never land. I can't immediately
guess how that's happening, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Bert Desmet
0477/305361

#485Robert Haas
robertmhaas@gmail.com
In reply to: Bert (#484)
Re: Parallel Seq Scan

On Tue, Nov 17, 2015 at 4:51 PM, Bert <biertie@gmail.com> wrote:

Hey Robert,

Thank you for the help. As you might (not) know, I'm quite new to the
community, but I'm learning. with the help from people like you.
anyhow, find attached a third attempt to a valid backtrace file.

This run is compiled from commit 5f10b7a604c87fc61a2c20a56552301f74c9bd5f
and your latest patch atteched in this mailtrack.

Thanks. This is great. Can you also run these commands:

frame 1
p *scan

The first command should select the heap_parallelscan_nextpage frame. The
second command should print the contents of the scan object.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#486Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#481)
Re: Parallel Seq Scan

On Tue, Nov 17, 2015 at 11:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Nov 16, 2015 at 9:49 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I don't understand this part.

Here the above check always passes as tqueue->tupledesc is not
set due to which it always try to build remap info. Is there any reason
for doing so?

Groan. The problem here is that tqueue->tupledesc never gets set.

Yes that was the problem.

I
think this should be fixed as in the attached.

Works for me!

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#487Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#483)
1 attachment(s)
Re: Parallel Seq Scan

On Wed, Nov 18, 2015 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 12, 2015 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Thanks for the report. The reason for this problem is that

instrumentation

information from workers is getting aggregated multiple times. In
ExecShutdownGatherWorkers(), we call ExecParallelFinish where it
will wait for workers to finish and then accumulate stats from workers.
Now ExecShutdownGatherWorkers() could be called multiple times
(once we read all tuples from workers, at end of node) and it should be
ensured that repeated calls should not try to redo the work done by

first

call.
The same is ensured for tuplequeues, but not for parallel executor info.
I think we can safely assume that we need to call ExecParallelFinish()

only

when there are workers started by the Gathers node, so on those lines
attached patch should fix the problem.

I suggest that we instead fix ExecParallelFinish() to be idempotent.
Add a "bool finished" flag to ParallelExecutorInfo and return at once
if it's already set. Get rid of the exposed
ExecParallelReinitializeTupleQueues() interface and have
ExecParallelReinitialize(pei) instead. Have that call
ReinitializeParallelDSM(), ExecParallelSetupTupleQueues(pei->pcxt,
true), and set pei->finished = false. I think that would give us a
slightly cleaner separation of concerns between nodeGather.c and
execParallel.c.

Okay, attached patch fixes the issue as per above suggestion.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_finish_parallel_executor_info_v1.patchapplication/octet-stream; name=fix_finish_parallel_executor_info_v1.patchDownload
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index eae13c5..6730037 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -277,13 +277,15 @@ ExecParallelSetupTupleQueues(ParallelContext *pcxt, bool reinitialize)
 }
 
 /*
- * Re-initialize the response queues for backend workers to return tuples
- * to the main backend and start the workers.
+ * Re-initialize the parallel executor info such that it can be reused by
+ * workers.
  */
-shm_mq_handle **
-ExecParallelReinitializeTupleQueues(ParallelContext *pcxt)
+void
+ExecParallelReinitialize(ParallelExecutorInfo *pei)
 {
-	return ExecParallelSetupTupleQueues(pcxt, true);
+	ReinitializeParallelDSM(pei->pcxt);
+	pei->tqueue = ExecParallelSetupTupleQueues(pei->pcxt, true);
+	pei->finished = false;
 }
 
 /*
@@ -308,6 +310,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate, int nworkers)
 
 	/* Allocate object for return value. */
 	pei = palloc0(sizeof(ParallelExecutorInfo));
+	pei->finished = false;
 	pei->planstate = planstate;
 
 	/* Fix up and serialize plan to be sent to workers. */
@@ -469,6 +472,9 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 {
 	int		i;
 
+	if (pei->finished)
+		return;
+
 	/* First, wait for the workers to finish. */
 	WaitForParallelWorkersToFinish(pei->pcxt);
 
@@ -480,6 +486,8 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 	if (pei->instrumentation)
 		ExecParallelRetrieveInstrumentation(pei->planstate,
 											pei->instrumentation);
+
+	pei->finished = true;
 }
 
 /*
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index b368b48..b6e82d1 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -456,11 +456,7 @@ ExecReScanGather(GatherState *node)
 	node->initialized = false;
 
 	if (node->pei)
-	{
-		ReinitializeParallelDSM(node->pei->pcxt);
-		node->pei->tqueue =
-				ExecParallelReinitializeTupleQueues(node->pei->pcxt);
-	}
+		ExecParallelReinitialize(node->pei);
 
 	ExecReScan(node->ps.lefttree);
 }
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 23c29eb..b43af1d 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -27,12 +27,13 @@ typedef struct ParallelExecutorInfo
 	BufferUsage *buffer_usage;
 	SharedExecutorInstrumentation *instrumentation;
 	shm_mq_handle **tqueue;
+	bool	finished;
 }	ParallelExecutorInfo;
 
 extern ParallelExecutorInfo *ExecInitParallelPlan(PlanState *planstate,
 					 EState *estate, int nworkers);
 extern void ExecParallelFinish(ParallelExecutorInfo *pei);
 extern void ExecParallelCleanup(ParallelExecutorInfo *pei);
-extern shm_mq_handle **ExecParallelReinitializeTupleQueues(ParallelContext *pcxt);
+extern void ExecParallelReinitialize(ParallelExecutorInfo *pei);
 
 #endif   /* EXECPARALLEL_H */
#488Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#487)
Re: Parallel Seq Scan

On Wed, Nov 18, 2015 at 12:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I suggest that we instead fix ExecParallelFinish() to be idempotent.
Add a "bool finished" flag to ParallelExecutorInfo and return at once
if it's already set. Get rid of the exposed
ExecParallelReinitializeTupleQueues() interface and have
ExecParallelReinitialize(pei) instead. Have that call
ReinitializeParallelDSM(), ExecParallelSetupTupleQueues(pei->pcxt,
true), and set pei->finished = false. I think that would give us a
slightly cleaner separation of concerns between nodeGather.c and
execParallel.c.

Okay, attached patch fixes the issue as per above suggestion.

Thanks, committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#489Amit Kapila
amit.kapila16@gmail.com
In reply to: Bert (#475)
1 attachment(s)
Re: Parallel Seq Scan

On Tue, Nov 17, 2015 at 1:21 AM, Bert <biertie@gmail.com> wrote:

Hey,

I've just pulled and compiled the new code.
I'm running a TPC-DS like test on different PostgreSQL installations, but
running (max) 12queries in parallel on a server with 12cores.
I've configured max_parallel_degree to 2, and I get messages that backend
processes crash.
I am running the same test now with 6queries in parallel, and parallel
degree to 2, and they seem to work. for now. :)

This is the output I get in /var/log/messages
Nov 16 20:40:05 woludwha02 kernel: postgres[22918]: segfault at
7fa3437bf104 ip 0000000000490b56 sp 00007ffdf2f083a0 error 6 in
postgres[400000+5b5000]

Thanks for reporting the issue.

I think whats going on here is that when any of the session doesn't
get any workers, we shutdown the Gather node which internally destroys
the dynamic shared memory segment as well. However the same is
needed as per current design for doing scan by master backend as
well. So I think the fix would be to just do shutdown of workers which
actually won't do anything in this scenario. I have tried to reproduce
this issue with a simpler test case as below:

Create two tables with large data:
CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 1000000) g;

Set max_worker_processes = 2 in postgresql.conf

Session-1
set max_parallel_degree=4;
set parallel_tuple_cost=0;
set parallel_setup_cost=0;
Explain analyze select count(*) from t1 where c1 > 10000;

Session-2
set max_parallel_degree=4;
set parallel_tuple_cost=0;
set parallel_setup_cost=0;
Explain analyze select count(*) from t2 where c1 > 10000;

The trick to reproduce is that the Explain statement in Session-2
needs to be executed immediately after Explain statement in
Session-1.

Attached patch fixes the issue for me.

I think here we can go for somewhat more invasive fix as well which is
if the statement didn't find any workers, then reset the dsm and also
reset the execution tree (which in case of seq scan means clear the
parallel scan desc and may be few more fields in scan desc) such that it
performs seq scan. I am not sure how future-proof such a change would
be, because resetting some of the fields in execution tree and expecting
it to work in all cases might not be feasible for all nodes.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_early_dsm_destroy_v1.patchapplication/octet-stream; name=fix_early_dsm_destroy_v1.patchDownload
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index b368b48..8d23205 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -190,7 +190,7 @@ ExecGather(GatherState *node)
 
 			/* No workers?  Then never mind. */
 			if (!got_any_worker)
-				ExecShutdownGather(node);
+				ExecShutdownGatherWorkers(node);
 		}
 
 		/* Run plan locally if no workers or not single-copy. */
#490Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#489)
Re: Parallel Seq Scan

On Wed, Nov 18, 2015 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think whats going on here is that when any of the session doesn't
get any workers, we shutdown the Gather node which internally destroys
the dynamic shared memory segment as well. However the same is
needed as per current design for doing scan by master backend as
well. So I think the fix would be to just do shutdown of workers which
actually won't do anything in this scenario.

It seems silly to call ExecGatherShutdownWorkers() here when that's
going to be a no-op. I think we should just remove that line and the
if statement before it altogether and replace it with a comment
explaining why we can't nuke the DSM at this stage.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#491Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#490)
1 attachment(s)
Re: Parallel Seq Scan

On Thu, Nov 19, 2015 at 9:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 18, 2015 at 10:41 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think whats going on here is that when any of the session doesn't
get any workers, we shutdown the Gather node which internally destroys
the dynamic shared memory segment as well. However the same is
needed as per current design for doing scan by master backend as
well. So I think the fix would be to just do shutdown of workers which
actually won't do anything in this scenario.

It seems silly to call ExecGatherShutdownWorkers() here when that's
going to be a no-op. I think we should just remove that line and the
if statement before it altogether and replace it with a comment
explaining why we can't nuke the DSM at this stage.

Isn't it better to destroy the memory for readers array as that gets
allocated
even if there are no workers available for execution?

Attached patch fixes the issue by just destroying readers array.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

fix_early_dsm_destroy_v2.patchapplication/octet-stream; name=fix_early_dsm_destroy_v2.patchDownload
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index b368b48..f090b2b 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -188,9 +188,16 @@ ExecGather(GatherState *node)
 				}
 			}
 
-			/* No workers?  Then never mind. */
+			/*
+			 * It is very well possible that no workers are available for
+			 * execution, but still we can't destroy the DSM as that will
+			 * be required for execution in master backend.
+			 */
 			if (!got_any_worker)
-				ExecShutdownGather(node);
+			{
+				pfree(node->reader);
+				node->reader = NULL;
+			}
 		}
 
 		/* Run plan locally if no workers or not single-copy. */
@@ -402,6 +409,8 @@ ExecShutdownGatherWorkers(GatherState *node)
 
 		for (i = 0; i < node->nreaders; ++i)
 			DestroyTupleQueueReader(node->reader[i]);
+
+		pfree(node->reader);
 		node->reader = NULL;
 	}
 
#492Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#491)
Re: Parallel Seq Scan

On Thu, Nov 19, 2015 at 11:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Isn't it better to destroy the memory for readers array as that gets
allocated
even if there are no workers available for execution?

Attached patch fixes the issue by just destroying readers array.

Well, then you're making ExecGatherShutdownWorkers() not a no-op any
more. I'll go commit a combination of your two patches.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#493Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#492)
Re: Parallel Seq Scan

On Fri, Nov 20, 2015 at 11:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 19, 2015 at 11:59 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Isn't it better to destroy the memory for readers array as that gets
allocated
even if there are no workers available for execution?

Attached patch fixes the issue by just destroying readers array.

Well, then you're making ExecGatherShutdownWorkers() not a no-op any
more. I'll go commit a combination of your two patches.

Thanks!

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#494Michael Paquier
michael.paquier@gmail.com
In reply to: Amit Kapila (#493)
Re: Parallel Seq Scan

On Sun, Nov 22, 2015 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 20, 2015 at 11:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Nov 19, 2015 at 11:59 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Isn't it better to destroy the memory for readers array as that gets
allocated
even if there are no workers available for execution?

Attached patch fixes the issue by just destroying readers array.

Well, then you're making ExecGatherShutdownWorkers() not a no-op any
more. I'll go commit a combination of your two patches.

Thanks!

There is still an entry in the CF app for this thread as "Parallel Seq
scan". The basic infrastructure has been committed, and I understand
that this is a never-ending tasks and that there will be many
optimizations. Still, are you guys fine to switch this entry as
committed for now?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#495Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#494)
Re: Parallel Seq Scan

On Wed, Dec 2, 2015 at 12:06 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Sun, Nov 22, 2015 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Nov 20, 2015 at 11:34 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Thu, Nov 19, 2015 at 11:59 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Isn't it better to destroy the memory for readers array as that gets
allocated
even if there are no workers available for execution?

Attached patch fixes the issue by just destroying readers array.

Well, then you're making ExecGatherShutdownWorkers() not a no-op any
more. I'll go commit a combination of your two patches.

Thanks!

There is still an entry in the CF app for this thread as "Parallel Seq
scan". The basic infrastructure has been committed, and I understand
that this is a never-ending tasks and that there will be many
optimizations. Still, are you guys fine to switch this entry as
committed for now?

I am fine with it. I think the further optimizations can be done
separately.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#496Michael Paquier
michael.paquier@gmail.com
In reply to: Amit Kapila (#495)
Re: Parallel Seq Scan

On Wed, Dec 2, 2015 at 5:45 PM, Amit Kapila wrote:

I am fine with it. I think the further optimizations can be done
separately.

Done.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers